gme-Qwen2-VL-2B-Instruct
Go to file
xxl 3436f1f224 first commit 2024-12-31 10:06:09 +08:00
images first commit 2024-12-31 10:06:09 +08:00
.gitattributes Add .gitattributes 2024-12-31 09:03:51 +08:00
README.md first commit 2024-12-31 10:06:09 +08:00
added_tokens.json first commit 2024-12-31 10:06:09 +08:00
chat_template.json first commit 2024-12-31 10:06:09 +08:00
config.json first commit 2024-12-31 10:06:09 +08:00
configuration.json first commit 2024-12-31 10:06:09 +08:00
generation_config.json first commit 2024-12-31 10:06:09 +08:00
merges.txt first commit 2024-12-31 10:06:09 +08:00
model-00001-of-00003.safetensors first commit 2024-12-31 10:06:09 +08:00
model-00002-of-00003.safetensors first commit 2024-12-31 10:06:09 +08:00
model-00003-of-00003.safetensors first commit 2024-12-31 10:06:09 +08:00
model.safetensors.index.json first commit 2024-12-31 10:06:09 +08:00
preprocessor_config.json first commit 2024-12-31 10:06:09 +08:00
results.json first commit 2024-12-31 10:06:09 +08:00
special_tokens_map.json first commit 2024-12-31 10:06:09 +08:00
tokenizer.json first commit 2024-12-31 10:06:09 +08:00
tokenizer_config.json first commit 2024-12-31 10:06:09 +08:00
vocab.json first commit 2024-12-31 10:06:09 +08:00

README.md

frameworks license tasks language base_model metrics
Pytorch
Apache License 2.0
multi-modal-embedding
zh
en
Qwen/Qwen2-VL-2B-Instruct
accuracy

GME Logo

GME: General Multimodal Embedding

GME-Qwen2-VL-2B

GME-Qwen2VL 系列统一的多模态Embedding模型基于Qwen2-VL 多模态大型语言模型 (MLLMs)训练。GME 模型支持三种类型的输入:文本图像图像-文本对,所有这些输入类型都可以生成通用的向量表示,并具有优秀的检索性能。

GME模型主要特点:

  • 统一的多模态表示GME 模型可以处理单一模态和组合模态的输入生成统一的向量表示。这使得各种检索场景Any2Any 搜索)成为可能,支持如文本检索、从文本检索图像以及图像之间的检索等任务。
  • 高性能:在我们的通用多模态检索基准 (UMRB) 中达到了最先进 (SOTA) 的结果,并在多语言文本评估基准 (MTEB) 中表现优异。
  • 动态图像分辨率:受益于 Qwen2-VL的特性GME模型支持动态分辨率的图像输入。
  • 强大的视觉检索性能得益于Qwen2-VL模型系列和训练数据我们的模型在视觉文档检索任务(例如表格PDF检索)中表现优异。这一能力对于复杂的文档理解场景尤为重要,例如专注于学术论文的多模态检索增强生成 (RAG) 应用。

Paper: GME: Improving Universal Multimodal Retrieval by Multimodal LLMs

Model List

Models Model Size Max Seq. Length Dimension MTEB-en MTEB-zh UMRB
gme-Qwen2-VL-2B 2.21B 32768 1536 65.27 66.92 64.45
gme-Qwen2-VL-7B 8.29B 32768 3584 67.48 69.73 67.44

Usage

Use with custom code

# You can find the script gme_inference.py in https://modelscope.cn/models/iic/gme-Qwen2-VL-7B-Instruct/file/view/master?fileName=gme_inference.py
from gme_inference import GmeQwen2VL

texts = [
    "What kind of car is this?",
    "The Tesla Cybertruck is a battery electric pickup truck built by Tesla, Inc. since 2023."
]
images = [
    'https://en.wikipedia.org/wiki/File:Tesla_Cybertruck_damaged_window.jpg',
    'https://en.wikipedia.org/wiki/File:2024_Tesla_Cybertruck_Foundation_Series,_front_left_(Greenwich).jpg',
]

gme = GmeQwen2VL("gme-Qwen2-VL-7B-Instruct")

# Single-modal embedding
e_text = gme.get_text_embeddings(texts=texts)
e_image = gme.get_image_embeddings(images=images)
print((e_text * e_image).sum(-1))
## tensor([0.2281, 0.6001], dtype=torch.float16)

# How to set embedding instruction
e_query = gme.get_text_embeddings(texts=texts, instruction='Find an image that matches the given text.')
# If is_query=False, we always use the default instruction.
e_corpus = gme.get_image_embeddings(images=images, is_query=False)
print((e_query * e_corpus).sum(-1))
## tensor([0.2433, 0.7051], dtype=torch.float16)

# Fused-modal embedding
e_fused = gme.get_fused_embeddings(texts=texts, images=images)
print((e_fused[0] * e_fused[1]).sum())
## tensor(0.6108, dtype=torch.float16)

UMRB评测

Single-modal Cross-modal Fused-modal Avg.
T→T (16) I→I (1) T→I (4) T→VD (10) I→T (4) T→IT (2) IT→T (5) IT→I (2) IT→IT (3) (47)
VISTA 0.2B 55.15 31.98 32.88 10.12 31.23 45.81 53.32 8.97 26.26 37.32
CLIP-SF 0.4B 39.75 31.42 59.05 24.09 62.95 66.41 53.32 34.9 55.65 43.66
One-Peace 4B 43.54 31.27 61.38 42.9 65.59 42.72 28.29 6.73 23.41 42.01
DSE 4.2B 48.94 27.92 40.75 78.21 52.54 49.62 35.44 8.36 40.18 50.04
E5-V 8.4B 52.41 27.36 46.56 41.22 47.95 54.13 32.9 23.17 7.23 42.52
GME-Qwen2-VL-2B 2.2B 55.93 29.86 57.36 87.84 61.93 76.47 64.58 37.02 66.47 64.45
GME-Qwen2-VL-7B 8.3B 58.19 31.89 61.35 89.92 65.83 80.94 66.18 42.56 73.62 67.44

The MTEB Leaderboard English tab shows the text embeddings performence of our model.

更详细的实验结果可以在论文中找到

限制

  • 单图像输入为了获得较好的训练效率我们将视觉标记的数量限制为1024。由于相关数据的缺乏我们的模型和评估仅保留单一图像。
  • 单语训练:我们的模型仅在英语数据上进行训练。尽管Qwen2-VL模型是多语言的但多语言多模态Embedding嵌入的性能不能完全保证

我们将在未来的版本中扩展到多图像输入、图文交错的数据以及多语言数据。

阿里云API服务

除了开源的GME系列模型GME模型也可以作为商业API服务在阿里云上使用。

请注意商业API背后的模型与开源模型并不完全相同。

引用

如果您发现我们的论文或模型对您有帮助,请考虑引用:

@misc{zhang2024gme,
      title={GME: Improving Universal Multimodal Retrieval by Multimodal LLMs}, 
      author={Zhang, Xin and Zhang, Yanzhao and Xie, Wen and Li, Mingxin and Dai, Ziqi and Long, Dingkun and Xie, Pengjun and Zhang, Meishan and Li, Wenjie and Zhang, Min},
      year={2024},
      eprint={2412.16855},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={http://arxiv.org/abs/2412.16855}, 
}