gme-Qwen2-VL-2B-Instruct_a14122304306802688376722

gme-Qwen2-VL-2B-Instruct

Go to file

xxl 3436f1f224 first commit		2024-12-31 10:06:09 +08:00
images	first commit	2024-12-31 10:06:09 +08:00
.gitattributes	Add .gitattributes	2024-12-31 09:03:51 +08:00
README.md	first commit	2024-12-31 10:06:09 +08:00
added_tokens.json	first commit	2024-12-31 10:06:09 +08:00
chat_template.json	first commit	2024-12-31 10:06:09 +08:00
config.json	first commit	2024-12-31 10:06:09 +08:00
configuration.json	first commit	2024-12-31 10:06:09 +08:00
generation_config.json	first commit	2024-12-31 10:06:09 +08:00
merges.txt	first commit	2024-12-31 10:06:09 +08:00
model-00001-of-00003.safetensors	first commit	2024-12-31 10:06:09 +08:00
model-00002-of-00003.safetensors	first commit	2024-12-31 10:06:09 +08:00
model-00003-of-00003.safetensors	first commit	2024-12-31 10:06:09 +08:00
model.safetensors.index.json	first commit	2024-12-31 10:06:09 +08:00
preprocessor_config.json	first commit	2024-12-31 10:06:09 +08:00
results.json	first commit	2024-12-31 10:06:09 +08:00
special_tokens_map.json	first commit	2024-12-31 10:06:09 +08:00
tokenizer.json	first commit	2024-12-31 10:06:09 +08:00
tokenizer_config.json	first commit	2024-12-31 10:06:09 +08:00
vocab.json	first commit	2024-12-31 10:06:09 +08:00

README.md

frameworks

license

tasks

language

base_model

metrics

Pytorch

Apache License 2.0

multi-modal-embedding

Qwen/Qwen2-VL-2B-Instruct

accuracy

GME Logo

GME: General Multimodal Embedding

GME-Qwen2-VL-2B

GME-Qwen2VL 系列统一的多模态Embedding模型基于Qwen2-VL 多模态大型语言模型 (MLLMs)训练。GME 模型支持三种类型的输入：文本、图像和图像-文本对，所有这些输入类型都可以生成通用的向量表示，并具有优秀的检索性能。

GME模型主要特点:

统一的多模态表示：GME 模型可以处理单一模态和组合模态的输入，生成统一的向量表示。这使得各种检索场景（Any2Any 搜索）成为可能，支持如文本检索、从文本检索图像以及图像之间的检索等任务。
高性能：在我们的通用多模态检索基准 (UMRB) 中达到了最先进 (SOTA) 的结果，并在多语言文本评估基准 (MTEB) 中表现优异。
动态图像分辨率：受益于 Qwen2-VL的特性，GME模型支持动态分辨率的图像输入。
强大的视觉检索性能：得益于Qwen2-VL模型系列和训练数据，我们的模型在视觉文档检索任务(例如表格PDF检索)中表现优异。这一能力对于复杂的文档理解场景尤为重要，例如专注于学术论文的多模态检索增强生成 (RAG) 应用。

Paper: GME: Improving Universal Multimodal Retrieval by Multimodal LLMs

Model List

Models	Model Size	Max Seq. Length	Dimension	MTEB-en	MTEB-zh	UMRB
`gme-Qwen2-VL-2B`	2.21B	32768	1536	65.27	66.92	64.45
`gme-Qwen2-VL-7B`	8.29B	32768	3584	67.48	69.73	67.44

Usage

Use with custom code

# You can find the script gme_inference.py in https://modelscope.cn/models/iic/gme-Qwen2-VL-7B-Instruct/file/view/master?fileName=gme_inference.py
from gme_inference import GmeQwen2VL

texts = [
    "What kind of car is this?",
    "The Tesla Cybertruck is a battery electric pickup truck built by Tesla, Inc. since 2023."
]
images = [
    'https://en.wikipedia.org/wiki/File:Tesla_Cybertruck_damaged_window.jpg',
    'https://en.wikipedia.org/wiki/File:2024_Tesla_Cybertruck_Foundation_Series,_front_left_(Greenwich).jpg',
]

gme = GmeQwen2VL("gme-Qwen2-VL-7B-Instruct")

# Single-modal embedding
e_text = gme.get_text_embeddings(texts=texts)
e_image = gme.get_image_embeddings(images=images)
print((e_text * e_image).sum(-1))
## tensor([0.2281, 0.6001], dtype=torch.float16)

# How to set embedding instruction
e_query = gme.get_text_embeddings(texts=texts, instruction='Find an image that matches the given text.')
# If is_query=False, we always use the default instruction.
e_corpus = gme.get_image_embeddings(images=images, is_query=False)
print((e_query * e_corpus).sum(-1))
## tensor([0.2433, 0.7051], dtype=torch.float16)

# Fused-modal embedding
e_fused = gme.get_fused_embeddings(texts=texts, images=images)
print((e_fused[0] * e_fused[1]).sum())
## tensor(0.6108, dtype=torch.float16)

UMRB评测

		Single-modal		Cross-modal			Fused-modal				Avg.
		T→T (16)	I→I (1)	T→I (4)	T→VD (10)	I→T (4)	T→IT (2)	IT→T (5)	IT→I (2)	IT→IT (3)	(47)
VISTA	0.2B	55.15	31.98	32.88	10.12	31.23	45.81	53.32	8.97	26.26	37.32
CLIP-SF	0.4B	39.75	31.42	59.05	24.09	62.95	66.41	53.32	34.9	55.65	43.66
One-Peace	4B	43.54	31.27	61.38	42.9	65.59	42.72	28.29	6.73	23.41	42.01
DSE	4.2B	48.94	27.92	40.75	78.21	52.54	49.62	35.44	8.36	40.18	50.04
E5-V	8.4B	52.41	27.36	46.56	41.22	47.95	54.13	32.9	23.17	7.23	42.52
GME-Qwen2-VL-2B	2.2B	55.93	29.86	57.36	87.84	61.93	76.47	64.58	37.02	66.47	64.45
GME-Qwen2-VL-7B	8.3B	58.19	31.89	61.35	89.92	65.83	80.94	66.18	42.56	73.62	67.44

The MTEB Leaderboard English tab shows the text embeddings performence of our model.

更详细的实验结果可以在论文中找到。

限制

单图像输入：为了获得较好的训练效率，我们将视觉标记的数量限制为1024。由于相关数据的缺乏，我们的模型和评估仅保留单一图像。
单语训练：我们的模型仅在英语数据上进行训练。尽管Qwen2-VL模型是多语言的，但多语言多模态Embedding嵌入的性能不能完全保证

我们将在未来的版本中扩展到多图像输入、图文交错的数据以及多语言数据。

阿里云API服务

除了开源的GME系列模型，GME模型也可以作为商业API服务在阿里云上使用。

多模态向量模型：提供 multimodal-embedding-v1 模型服务。

请注意，商业API背后的模型与开源模型并不完全相同。

引用

如果您发现我们的论文或模型对您有帮助，请考虑引用：

@misc{zhang2024gme,
      title={GME: Improving Universal Multimodal Retrieval by Multimodal LLMs}, 
      author={Zhang, Xin and Zhang, Yanzhao and Xie, Wen and Li, Mingxin and Dai, Ziqi and Long, Dingkun and Xie, Pengjun and Zhang, Meishan and Li, Wenjie and Zhang, Min},
      year={2024},
      eprint={2412.16855},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={http://arxiv.org/abs/2412.16855}, 
}

README.md Unescape Escape