vdr-2b-multi-v1 is a multilingual embedding model designed for visual document retrieval across multiple languages and domains. It encodes document page screenshots into dense single-vector representations, this will effectively allow to search and query visually rich multilingual documents without the need for any OCR, data extraction pipelines, chunking...
- **Trained on 🇮🇹 Italian, 🇪🇸 Spanish, 🇬🇧 English, 🇫🇷 French and 🇩🇪 German:** together they form a new large, open-source, multilingual training dataset of 500k high-quality samples.
- **Cross-lingual Retrieval**: substantially better on real-world scenarios. For example, this allows for searching german documents with italian queries.
- **Matryoshka Representation Learning**: You can reduce the vectors size 3x and still keep 98% of the embeddings quality.
# Usage
The model uses bf16 tensors and allocates ~4.4GB of VRAM when loaded. You can easily run inference and generate embeddings using 768 image patches and a batch size of 16 even on a cheap NVIDIA T4 GPU. This table reports the memory footprint (GB) under conditions of different batch sizes with HuggingFace Transformers and maximum 768 image patches.
| Batch Size | GPU Memory (GB) |
|------------|-----------------|
| 4 | 6.9 |
| 8 | 8.8 |
| 16 | 11.5 |
| 32 | 19.7 |
You can generate embeddings with this model in many different ways:
<detailsopen>
<summary>
via LlamaIndex
</summary>
```bash
pip install -U llama-index-embeddings-huggingface
```
```python
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
model = HuggingFaceEmbedding(
model_name="llamaindex/vdr-2b-multi-v1",
device="cpu", # "mps" for mac, "cuda" for nvidia GPUs
from transformers import AutoProcessor, Qwen2VLForConditionalGeneration
from PIL import Image
import torch
import math
# more pixels -> better embeddings -> more VRAM -> slower inference
# From my experience, 768 image patches is the right spot for compute efficient embeddings.
max_pixels = 768 * 28 * 28
min_pixels = 1 * 28 * 28
# Load the embedding model and processor
model = Qwen2VLForConditionalGeneration.from_pretrained(
'llamaindex/vdr-2b-multi-v1',
# These are the recommended kwargs for the model, but change them as needed
attn_implementation="flash_attention_2",
torch_dtype=torch.bfloat16,
device_map="cuda:0"
).eval()
processor = AutoProcessor.from_pretrained(
'llamaindex/vdr-2b-multi-v1',
min_pixels=min_pixels,
max_pixels=max_pixels
)
model.padding_side = "left"
processor.tokenizer.padding_side = "left"
document_prompt = "<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n<|im_start|>user\n<|vision_start|><|image_pad|><|vision_end|>What is shown in this image?<|im_end|>\n<|endoftext|>"
query_prompt = "<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n<|im_start|>user\n<|vision_start|><|image_pad|><|vision_end|>Query: %s<|im_end|>\n<|endoftext|>"
from sentence_transformers import SentenceTransformer
model = SentenceTransformer(
model_name_or_path="llamaindex/vdr-2b-multi-v1",
device="cuda",
trust_remote_code=True,
# These are the recommended kwargs for the model, but change them as needed if you don't have CUDA
model_kwargs={
"torch_dtype": torch.bfloat16,
"device_map": "cuda:0",
"attn_implementation": "flash_attention_2"
},
)
embeddings = model.encode("image.png")
```
</details>
# Training
The model is based on [MrLight/dse-qwen2-2b-mrl-v1](https://huggingface.co/MrLight/dse-qwen2-2b-mrl-v1) and it was trained on the new [vdr-multilingual-train](https://huggingface.co/datasets/llamaindex/vdr-multilingual-train) dataset that consinsists of 500k high quality, multilingual query image pairs. It was trained for 1 epoch using the [DSE approach](https://arxiv.org/abs/2406.11251), with a batch size of 128 and hard-mined negatives.
# Results

The model has been evaluated on the Vidore benchmark and on custom-built evaluation sets that allow testing its multilingual capabilities on text-only, visual-only and mixed page screenshots. The evaluation dataset is publicly available [here on HuggingFace](https://huggingface.co/datasets/llamaindex/vdr-multilingual-test).
All evaluations are performed by calculating **NDCG@5** scores using **1536 dimensions** vectors and an image resolution that can be represented with **maximum 768 tokens**.
| | Avg | Italian (text) | Italian (visual) | Italian (mix) |