deepseek-vl2-tiny/README.md

---
license: other
license_name: deepseek
license_link: https://github.com/deepseek-ai/DeepSeek-LLM/blob/HEAD/LICENSE-MODEL
pipeline_tag: image-text-to-text
library_name: transformers
---

## 1. Introduction

Introducing DeepSeek-VL2, an advanced series of large Mixture-of-Experts (MoE) Vision-Language Models that significantly improves upon its predecessor, DeepSeek-VL. DeepSeek-VL2 demonstrates superior capabilities across various tasks, including but not limited to visual question answering, optical character recognition,  document/table/chart understanding, and visual grounding. Our model series is composed of three variants: DeepSeek-VL2-Tiny, DeepSeek-VL2-Small and DeepSeek-VL2, with 1.0B, 2.8B and 4.5B activated parameters respectively.
DeepSeek-VL2 achieves competitive or state-of-the-art performance with similar or fewer activated parameters compared to existing open-source dense and MoE-based models.


[DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding](https://github.com/deepseek-ai/DeepSeek-VL2/blob/main/images/vl2_teaser.jpeg)

[**Github Repository**](https://github.com/deepseek-ai/DeepSeek-VL2)


Zhiyu Wu*, Xiaokang Chen*, Zizheng Pan*, Xingchao Liu*, Wen Liu**, Damai Dai, Huazuo Gao, Yiyang Ma, Chengyue Wu, Bingxuan Wang, Zhenda Xie, Yu Wu, Kai Hu, Jiawei Wang, Yaofeng Sun, Yukun Li, Yishi Piao, Kang Guan, Aixin Liu, Xin Xie, Yuxiang You, Kai Dong, Xingkai Yu, Haowei Zhang, Liang Zhao, Yisong Wang, Chong Ruan*** (* Equal Contribution, ** Project Lead, *** Corresponding author)

![](https://github.com/deepseek-ai/DeepSeek-VL2/tree/main/images/vl2_teaser.jpeg)


### 2. Model Summary

DeepSeek-VL2-tiny is built on DeepSeekMoE-3B (total activated parameters are 1.0B).


## 3. Quick Start

### Installation

On the basis of `Python >= 3.8` environment, install the necessary dependencies by running the following command:

### Simple Inference Example

```python
# pip install git+https://github.com/deepseek-ai/DeepSeek-VL2.git
# pip install "transformers<4.42"
import torch
from modelscope import AutoModelForCausalLM, snapshot_download

from deepseek_vl.models import DeepseekVLV2Processor, DeepseekVLV2ForCausalLM
from deepseek_vl.utils.io import load_pil_images


# specify the path to the model
model_path = snapshot_download("deepseek-ai/deepseek-vl2-tiny")
vl_chat_processor: DeepseekVLV2Processor = DeepseekVLV2Processor.from_pretrained(model_path)
tokenizer = vl_chat_processor.tokenizer

vl_gpt: DeepseekVLV2ForCausalLM = AutoModelForCausalLM.from_pretrained(model_path, trust_remote_code=True)
vl_gpt = vl_gpt.to(torch.bfloat16).cuda().eval()

## single image conversation example
conversation = [
    {
        "role": "<|User|>",
        "content": "<image>\n<|ref|>The giraffe at the back.<|/ref|>.",
        "images": ["./images/visual_grounding.jpeg"],
    },
    {"role": "<|Assistant|>", "content": ""},
]

# load images and prepare for inputs
pil_images = load_pil_images(conversation)
prepare_inputs = vl_chat_processor(
    conversations=conversation,
    images=pil_images,
    force_batchify=True,
    system_prompt=""
).to(vl_gpt.device)

# run image encoder to get the image embeddings
inputs_embeds = vl_gpt.prepare_inputs_embeds(**prepare_inputs)

# run the model to get the response
outputs = vl_gpt.language.generate(
    inputs_embeds=inputs_embeds,
    attention_mask=prepare_inputs.attention_mask,
    pad_token_id=tokenizer.eos_token_id,
    bos_token_id=tokenizer.bos_token_id,
    eos_token_id=tokenizer.eos_token_id,
    max_new_tokens=512,
    do_sample=False,
    use_cache=True
)

answer = tokenizer.decode(outputs[0].cpu().tolist(), skip_special_tokens=True)
print(f"{prepare_inputs['sft_format'][0]}", answer)
```

### Gradio Demo (TODO)


## 4. License

This code repository is licensed under [MIT License](./LICENSE-CODE). The use of DeepSeek-VL2 models is subject to [DeepSeek Model License](./LICENSE-MODEL). DeepSeek-VL2 series supports commercial use.

## 5. Citation

```
@misc{wu2024deepseekvl2,
      title={DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding},
      author={Wu, Zhiyu and Chen, Xiaokang and Pan, Zizheng and Liu, Xingchao and Liu, Wen and Dai, Damai and Gao, Huazuo and Ma, Yiyang and Wu, Chengyue and Wang, Bingxuan and Xie, Zhenda and Wu, Yu and Hu, Kai and Wang, Jiawei and Sun, Yaofeng and Li, Yukun and Piao, Yishi and Guan, Kang and Liu, Aixin and Xie, Xin and You, Yuxiang and Dong, Kai and Yu, Xingkai and Zhang, Haowei and Zhao, Liang and Wang, Yisong and Ruan, Chong},
      year={2024},
}
```

## 6. Contact

If you have any questions, please raise an issue or contact us at [service@deepseek.com](mailto:service@deepseek.com).
first commit 2025-01-07 17:04:50 +08:00			`---`
			`license: other`
			`license_name: deepseek`
			`license_link: https://github.com/deepseek-ai/DeepSeek-LLM/blob/HEAD/LICENSE-MODEL`
			`pipeline_tag: image-text-to-text`
			`library_name: transformers`
			`---`
Initial commit 2025-01-07 11:00:23 +08:00
first commit 2025-01-07 17:04:50 +08:00			`## 1. Introduction`

			Introducing DeepSeek-VL2, an advanced series of large Mixture-of-Experts (MoE) Vision-Language Models that significantly improves upon its predecessor, DeepSeek-VL. DeepSeek-VL2 demonstrates superior capabilities across various tasks, including but not limited to visual question answering, optical character recognition, document/table/chart understanding, and visual grounding. Our model series is composed of three variants: DeepSeek-VL2-Tiny, DeepSeek-VL2-Small and DeepSeek-VL2, with 1.0B, 2.8B and 4.5B activated parameters respectively.
			`DeepSeek-VL2 achieves competitive or state-of-the-art performance with similar or fewer activated parameters compared to existing open-source dense and MoE-based models.`


			`[DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding](https://github.com/deepseek-ai/DeepSeek-VL2/blob/main/images/vl2_teaser.jpeg)`

			`[Github Repository](https://github.com/deepseek-ai/DeepSeek-VL2)`


			`Zhiyu Wu, Xiaokang Chen, Zizheng Pan, Xingchao Liu, Wen Liu, Damai Dai, Huazuo Gao, Yiyang Ma, Chengyue Wu, Bingxuan Wang, Zhenda Xie, Yu Wu, Kai Hu, Jiawei Wang, Yaofeng Sun, Yukun Li, Yishi Piao, Kang Guan, Aixin Liu, Xin Xie, Yuxiang You, Kai Dong, Xingkai Yu, Haowei Zhang, Liang Zhao, Yisong Wang, Chong Ruan* (* Equal Contribution, Project Lead, * Corresponding author)`

			`![](https://github.com/deepseek-ai/DeepSeek-VL2/tree/main/images/vl2_teaser.jpeg)`


			`### 2. Model Summary`

			`DeepSeek-VL2-tiny is built on DeepSeekMoE-3B (total activated parameters are 1.0B).`


			`## 3. Quick Start`

			`### Installation`

			On the basis of `Python >= 3.8` environment, install the necessary dependencies by running the following command:

			`### Simple Inference Example`

			```python
			`# pip install git+https://github.com/deepseek-ai/DeepSeek-VL2.git`
			`# pip install "transformers<4.42"`
			`import torch`
			`from modelscope import AutoModelForCausalLM, snapshot_download`

			`from deepseek_vl.models import DeepseekVLV2Processor, DeepseekVLV2ForCausalLM`
			`from deepseek_vl.utils.io import load_pil_images`


			`# specify the path to the model`
			`model_path = snapshot_download("deepseek-ai/deepseek-vl2-tiny")`
			`vl_chat_processor: DeepseekVLV2Processor = DeepseekVLV2Processor.from_pretrained(model_path)`
			`tokenizer = vl_chat_processor.tokenizer`

			`vl_gpt: DeepseekVLV2ForCausalLM = AutoModelForCausalLM.from_pretrained(model_path, trust_remote_code=True)`
			`vl_gpt = vl_gpt.to(torch.bfloat16).cuda().eval()`

			`## single image conversation example`
			`conversation = [`
			`{`
			`"role": "<\|User\|>",`
			`"content": "<image>\n<\|ref\|>The giraffe at the back.<\|/ref\|>.",`
			`"images": ["./images/visual_grounding.jpeg"],`
			`},`
			`{"role": "<\|Assistant\|>", "content": ""},`
			`]`

			`# load images and prepare for inputs`
			`pil_images = load_pil_images(conversation)`
			`prepare_inputs = vl_chat_processor(`
			`conversations=conversation,`
			`images=pil_images,`
			`force_batchify=True,`
			`system_prompt=""`
			`).to(vl_gpt.device)`

			`# run image encoder to get the image embeddings`
			`inputs_embeds = vl_gpt.prepare_inputs_embeds(**prepare_inputs)`

			`# run the model to get the response`
			`outputs = vl_gpt.language.generate(`
			`inputs_embeds=inputs_embeds,`
			`attention_mask=prepare_inputs.attention_mask,`
			`pad_token_id=tokenizer.eos_token_id,`
			`bos_token_id=tokenizer.bos_token_id,`
			`eos_token_id=tokenizer.eos_token_id,`
			`max_new_tokens=512,`
			`do_sample=False,`
			`use_cache=True`
			`)`

			`answer = tokenizer.decode(outputs[0].cpu().tolist(), skip_special_tokens=True)`
			`print(f"{prepare_inputs['sft_format'][0]}", answer)`
			```

			`### Gradio Demo (TODO)`


			`## 4. License`

			`This code repository is licensed under [MIT License](./LICENSE-CODE). The use of DeepSeek-VL2 models is subject to [DeepSeek Model License](./LICENSE-MODEL). DeepSeek-VL2 series supports commercial use.`

			`## 5. Citation`

			```
			`@misc{wu2024deepseekvl2,`
			`title={DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding},`
			`author={Wu, Zhiyu and Chen, Xiaokang and Pan, Zizheng and Liu, Xingchao and Liu, Wen and Dai, Damai and Gao, Huazuo and Ma, Yiyang and Wu, Chengyue and Wang, Bingxuan and Xie, Zhenda and Wu, Yu and Hu, Kai and Wang, Jiawei and Sun, Yaofeng and Li, Yukun and Piao, Yishi and Guan, Kang and Liu, Aixin and Xie, Xin and You, Yuxiang and Dong, Kai and Yu, Xingkai and Zhang, Haowei and Zhao, Liang and Wang, Yisong and Ruan, Chong},`
			`year={2024},`
			`}`
			```

			`## 6. Contact`

			`If you have any questions, please raise an issue or contact us at [service@deepseek.com](mailto:service@deepseek.com).`