first commit
This commit is contained in:
parent
0104e4a128
commit
63cd7b5386
|
@ -0,0 +1,10 @@
|
|||
{
|
||||
"word_embedding_dimension": 768,
|
||||
"pooling_mode_cls_token": false,
|
||||
"pooling_mode_mean_tokens": true,
|
||||
"pooling_mode_max_tokens": false,
|
||||
"pooling_mode_mean_sqrt_len_tokens": false,
|
||||
"pooling_mode_weightedmean_tokens": false,
|
||||
"pooling_mode_lasttoken": false,
|
||||
"include_prompt": true
|
||||
}
|
274
README.md
274
README.md
|
@ -1,3 +1,273 @@
|
|||
# nomic-embed-text-v2-moe
|
||||
---
|
||||
base_model:
|
||||
- nomic-ai/nomic-embed-text-v2-moe-unsupervised
|
||||
library_name: sentence-transformers
|
||||
pipeline_tag: sentence-similarity
|
||||
tags:
|
||||
- sentence-transformers
|
||||
- sentence-similarity
|
||||
- feature-extraction
|
||||
license: apache-2.0
|
||||
language:
|
||||
- en
|
||||
- es
|
||||
- fr
|
||||
- de
|
||||
- it
|
||||
- pt
|
||||
- pl
|
||||
- nl
|
||||
- tr
|
||||
- ja
|
||||
- vi
|
||||
- ru
|
||||
- id
|
||||
- ar
|
||||
- cs
|
||||
- ro
|
||||
- sv
|
||||
- el
|
||||
- uk
|
||||
- zh
|
||||
- hu
|
||||
- da
|
||||
- 'no'
|
||||
- hi
|
||||
- fi
|
||||
- bg
|
||||
- ko
|
||||
- sk
|
||||
- th
|
||||
- he
|
||||
- ca
|
||||
- lt
|
||||
- fa
|
||||
- ms
|
||||
- sl
|
||||
- lv
|
||||
- mr
|
||||
- bn
|
||||
- sq
|
||||
- cy
|
||||
- be
|
||||
- ml
|
||||
- kn
|
||||
- mk
|
||||
- ur
|
||||
- fy
|
||||
- te
|
||||
- eu
|
||||
- sw
|
||||
- so
|
||||
- sd
|
||||
- uz
|
||||
- co
|
||||
- hr
|
||||
- gu
|
||||
- ce
|
||||
- eo
|
||||
- jv
|
||||
- la
|
||||
- zu
|
||||
- mn
|
||||
- si
|
||||
- ga
|
||||
- ky
|
||||
- tg
|
||||
- my
|
||||
- km
|
||||
- mg
|
||||
- pa
|
||||
- sn
|
||||
- ha
|
||||
- ht
|
||||
- su
|
||||
- gd
|
||||
- ny
|
||||
- ps
|
||||
- ku
|
||||
- am
|
||||
- ig
|
||||
- lo
|
||||
- mi
|
||||
- nn
|
||||
- sm
|
||||
- yi
|
||||
- st
|
||||
- tl
|
||||
- xh
|
||||
- yo
|
||||
- af
|
||||
- ta
|
||||
- tn
|
||||
- ug
|
||||
- az
|
||||
- ba
|
||||
- bs
|
||||
- dv
|
||||
- et
|
||||
- gl
|
||||
- gn
|
||||
- gv
|
||||
- hy
|
||||
---
|
||||
|
||||
nomic-embed-text-v2-moe
|
||||
# nomic-embed-text-v2-moe: Multilingual Mixture of Experts Text Embeddings
|
||||
|
||||
## Model Overview
|
||||
`nomic-embed-text-v2-moe` is SoTA multilingual MoE text embedding model that excels at multilingual retrieval:
|
||||
|
||||
- **High Performance**: SoTA Multilingual performance compared to ~300M parameter models, competitive with models 2x in size
|
||||
- **Multilinguality**: Supports ~100 languages and trained on over 1.6B pairs
|
||||
- **Flexible Embedding Dimension**: Trained with [Matryoshka Embeddings](https://arxiv.org/abs/2205.13147) with 3x reductions in storage cost with minimal performance degradations
|
||||
- **Fully Open-Source**: Model weights, [code](https://github.com/nomic-ai/contrastors), and training data (see code repo) released
|
||||
|
||||
|
||||
| Model | Params (M) | Emb Dim | BEIR | MIRACL | Pretrain Data | Finetune Data | Code |
|
||||
|-------|------------|----------|------|---------|---------------|---------------|------|
|
||||
| **Nomic Embed v2** | 305 | 768 | 52.86 | **65.80** | ✅ | ✅ | ✅ |
|
||||
| mE5 Base | 278 | 768 | 48.88 | 62.30 | ❌ | ❌ | ❌ |
|
||||
| mGTE Base | 305 | 768 | 51.10 | 63.40 | ❌ | ❌ | ❌ |
|
||||
| Arctic Embed v2 Base | 305 | 768 | **55.40** | 59.90 | ❌ | ❌ | ❌ |
|
||||
| |
|
||||
| BGE M3 | 568 | 1024 | 48.80 | **69.20** | ❌ | ✅ | ❌ |
|
||||
| Arctic Embed v2 Large | 568 | 1024 | **55.65** | 66.00 | ❌ | ❌ | ❌ |
|
||||
| mE5 Large | 560 | 1024 | 51.40 | 66.50 | ❌ | ❌ | ❌ |
|
||||
|
||||
|
||||
|
||||
## Model Architecture
|
||||
- **Total Parameters**: 475M
|
||||
- **Active Parameters During Inference**: 305M
|
||||
- **Architecture Type**: Mixture of Experts (MoE)
|
||||
- **MoE Configuration**: 8 experts with top-2 routing
|
||||
- **Embedding Dimensions**: Supports flexible dimension from 768 to 256 through Matryoshka representation learning
|
||||
- **Maximum Sequence Length**: 512 tokens
|
||||
- **Languages**: Supports dozens of languages (see Performance section)
|
||||
|
||||
|
||||
## Usage Guide
|
||||
|
||||
### Installation
|
||||
|
||||
The model can be used through SentenceTransformers and Transformers.
|
||||
|
||||
For best performance on GPU, please install
|
||||
|
||||
```bash
|
||||
pip install torch transformers einops git+https://github.com/nomic-ai/megablocks.git
|
||||
```
|
||||
|
||||
> [!IMPORTANT]
|
||||
> **Important!**
|
||||
> The text prompt *must* include a *task instruction prefix*, instructing the model which task is being performed.
|
||||
|
||||
Please use `search_query: ` before your queries/questions, and `search_document: ` before your documents.
|
||||
|
||||
### Transformers
|
||||
|
||||
If using Transformers, **make sure to prepend the task instruction prefix**.
|
||||
|
||||
```python
|
||||
import torch
|
||||
import torch.nn.functional as F
|
||||
from transformers import AutoTokenizer, AutoModel
|
||||
|
||||
tokenizer = AutoTokenizer.from_pretrained("nomic-ai/nomic-embed-text-v2-moe")
|
||||
model = AutoModel.from_pretrained("nomic-ai/nomic-embed-text-v2-moe", trust_remote_code=True)
|
||||
|
||||
sentences = ['search_document: Hello!', 'search_document: ¡Hola!']
|
||||
|
||||
def mean_pooling(model_output, attention_mask):
|
||||
token_embeddings = model_output[0]
|
||||
input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
|
||||
return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)
|
||||
|
||||
encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')
|
||||
model.eval()
|
||||
with torch.no_grad():
|
||||
model_output = model(**encoded_input)
|
||||
embeddings = mean_pooling(model_output, encoded_input['attention_mask'])
|
||||
embeddings = F.normalize(embeddings, p=2, dim=1)
|
||||
print(embeddings.shape)
|
||||
# torch.Size([2, 768])
|
||||
|
||||
similarity = F.cosine_similarity(embeddings[0], embeddings[1], dim=0)
|
||||
print(similarity)
|
||||
# tensor(0.9118)
|
||||
```
|
||||
|
||||
### SentenceTransformers
|
||||
|
||||
With SentenceTransformers, you can specify the `prompt_name` as either `"query"` or `"passage"`, and the task instruction will be included automatically.
|
||||
|
||||
```python
|
||||
from sentence_transformers import SentenceTransformer
|
||||
|
||||
model = SentenceTransformer("nomic-ai/nomic-embed-text-v2-moe", trust_remote_code=True)
|
||||
sentences = ["Hello!", "¡Hola!"]
|
||||
embeddings = model.encode(sentences, prompt_name="passage")
|
||||
print(embeddings.shape)
|
||||
# (2, 768)
|
||||
|
||||
similarity = model.similarity(embeddings[0], embeddings[1])
|
||||
print(similarity)
|
||||
# tensor([[0.9118]])
|
||||
```
|
||||
|
||||
## Performance
|
||||
|
||||
nomic-embed-text-v2-moe performance on BEIR and MIRACL compared to other open-weights embedding models:
|
||||
|
||||

|
||||
|
||||
nomic-embed-text-v2-moe performance on BEIR at 768 dimension and truncated to 256 dimensions:
|
||||
|
||||

|
||||
|
||||
## Best Practices
|
||||
- Add appropriate prefixes to your text:
|
||||
- For queries: "search_query: "
|
||||
- For documents: "search_document: "
|
||||
- Maximum input length is 512 tokens
|
||||
- For optimal efficiency, consider using the 256-dimension embeddings if storage/compute is a concern
|
||||
|
||||
## Limitations
|
||||
- Performance may vary across different languages
|
||||
- Resource requirements may be higher than traditional dense models due to MoE architecture
|
||||
- Must use `trust_remote_code=True` when loading the model to use our custom architecture implementation
|
||||
|
||||
## Training Details
|
||||
|
||||

|
||||
|
||||
- Trained on 1.6 billion high-quality pairs across multiple languages
|
||||
- Uses consistency filtering to ensure high-quality training data
|
||||
- Incorporates Matryoshka representation learning for dimension flexibility
|
||||
- Training includes both weakly-supervised contrastive pretraining and supervised finetuning
|
||||
|
||||
For more details, please check out the [blog post](https://www.nomic.ai/blog/posts/nomic-embed-text-v2) and [technical report](https://www.arxiv.org/abs/2502.07972).
|
||||
|
||||
|
||||
|
||||
## Join the Nomic Community
|
||||
|
||||
- Nomic: [https://nomic.ai](https://nomic.ai)
|
||||
- Discord: [https://discord.gg/myY5YDR8z8](https://discord.gg/myY5YDR8z8)
|
||||
- Twitter: [https://twitter.com/nomic_ai](https://twitter.com/nomic_ai)
|
||||
|
||||
# Citation
|
||||
|
||||
If you find the model, dataset, or training code useful, please cite our work
|
||||
|
||||
```bibtex
|
||||
@misc{nussbaum2025trainingsparsemixtureexperts,
|
||||
title={Training Sparse Mixture Of Experts Text Embedding Models},
|
||||
author={Zach Nussbaum and Brandon Duderstadt},
|
||||
year={2025},
|
||||
eprint={2502.07972},
|
||||
archivePrefix={arXiv},
|
||||
primaryClass={cs.CL},
|
||||
url={https://arxiv.org/abs/2502.07972},
|
||||
}
|
||||
```
|
|
@ -0,0 +1,74 @@
|
|||
{
|
||||
"_name_or_path": "nomic-ai/nomic-xlm-2048",
|
||||
"activation_function": "gelu",
|
||||
"add_pooling_layer": false,
|
||||
"architectures": [
|
||||
"NomicBertModel"
|
||||
],
|
||||
"attn_pdrop": 0.0,
|
||||
"auto_map": {
|
||||
"AutoConfig": "nomic-ai/nomic-bert-2048--configuration_hf_nomic_bert.NomicBertConfig",
|
||||
"AutoModel": "nomic-ai/nomic-bert-2048--modeling_hf_nomic_bert.NomicBertModel",
|
||||
"AutoModelForMaskedLM": "nomic-ai/nomic-bert-2048--modeling_hf_nomic_bert.NomicBertForPreTraining",
|
||||
"AutoModelForMultipleChoice": "nomic-ai/nomic-bert-2048--modeling_hf_nomic_bert.NomicBertForMultipleChoice",
|
||||
"AutoModelForQuestionAnswering": "nomic-ai/nomic-bert-2048--modeling_hf_nomic_bert.NomicBertForQuestionAnswering",
|
||||
"AutoModelForSequenceClassification": "nomic-ai/nomic-bert-2048--modeling_hf_nomic_bert.NomicBertForSequenceClassification",
|
||||
"AutoModelForTokenClassification": "nomic-ai/nomic-bert-2048--modeling_hf_nomic_bert.NomicBertForTokenClassification"
|
||||
},
|
||||
"bos_token_id": null,
|
||||
"causal": false,
|
||||
"dense_seq_output": true,
|
||||
"embd_pdrop": 0.1,
|
||||
"eos_token_id": null,
|
||||
"expert_choice_router": false,
|
||||
"ffn_div": 1,
|
||||
"fused_bias_fc": true,
|
||||
"fused_dropout_add_ln": true,
|
||||
"initializer_range": 0.02,
|
||||
"layer_norm_epsilon": 1e-05,
|
||||
"max_trained_positions": 2048,
|
||||
"mlp_fc1_bias": true,
|
||||
"mlp_fc2_bias": true,
|
||||
"model_type": "nomic_bert",
|
||||
"moe_every_n_layers": 2,
|
||||
"moe_impl": "megablocks",
|
||||
"moe_normalize_expert_weights": false,
|
||||
"moe_resid_pdrop": 0.0,
|
||||
"moe_top_k": 2,
|
||||
"n_embd": 768,
|
||||
"n_head": 12,
|
||||
"n_inner": 3072,
|
||||
"n_layer": 12,
|
||||
"n_positions": 2048,
|
||||
"num_experts": 8,
|
||||
"num_shared_experts": 0,
|
||||
"pad_token_id": 1,
|
||||
"pad_vocab_size_multiple": 64,
|
||||
"parallel_block": false,
|
||||
"parallel_block_tied_norm": false,
|
||||
"prenorm": false,
|
||||
"qkv_proj_bias": true,
|
||||
"reorder_and_upcast_attn": false,
|
||||
"resid_pdrop": 0.0,
|
||||
"rotary_emb_base": 10000,
|
||||
"rotary_emb_fraction": 1.0,
|
||||
"rotary_emb_interleaved": false,
|
||||
"rotary_emb_scale_base": null,
|
||||
"rotary_scaling_factor": null,
|
||||
"router_aux_loss_coef": 0.1,
|
||||
"scale_attn_by_inverse_layer_idx": false,
|
||||
"scale_attn_weights": true,
|
||||
"summary_activation": null,
|
||||
"summary_first_dropout": 0.1,
|
||||
"summary_proj_to_labels": true,
|
||||
"summary_type": "cls_index",
|
||||
"summary_use_proj": true,
|
||||
"torch_dtype": "float32",
|
||||
"transformers_version": "4.44.2",
|
||||
"type_vocab_size": 1,
|
||||
"use_cache": true,
|
||||
"use_flash_attn": true,
|
||||
"use_rms_norm": null,
|
||||
"use_xentropy": true,
|
||||
"vocab_size": 250048
|
||||
}
|
|
@ -0,0 +1,20 @@
|
|||
{
|
||||
"__version__": {
|
||||
"sentence_transformers": "3.3.0",
|
||||
"transformers": "4.44.2",
|
||||
"pytorch": "2.4.1+cu121"
|
||||
},
|
||||
"prompts": {
|
||||
"query": "search_query: ",
|
||||
"passage": "search_document: ",
|
||||
"Classification": "classification: ",
|
||||
"MultilabelClassification": "classification: ",
|
||||
"Clustering": "clustering: ",
|
||||
"PairClassification": "classification: ",
|
||||
"STS": "classification: ",
|
||||
"Summarization": "classification: ",
|
||||
"Speed": "search_document: "
|
||||
},
|
||||
"default_prompt_name": null,
|
||||
"similarity_fn_name": "cosine"
|
||||
}
|
Binary file not shown.
|
@ -0,0 +1,20 @@
|
|||
[
|
||||
{
|
||||
"idx": 0,
|
||||
"name": "0",
|
||||
"path": "",
|
||||
"type": "sentence_transformers.models.Transformer"
|
||||
},
|
||||
{
|
||||
"idx": 1,
|
||||
"name": "1",
|
||||
"path": "1_Pooling",
|
||||
"type": "sentence_transformers.models.Pooling"
|
||||
},
|
||||
{
|
||||
"idx": 2,
|
||||
"name": "2",
|
||||
"path": "2_Normalize",
|
||||
"type": "sentence_transformers.models.Normalize"
|
||||
}
|
||||
]
|
|
@ -0,0 +1,4 @@
|
|||
{
|
||||
"max_seq_length": 512,
|
||||
"do_lower_case": false
|
||||
}
|
Binary file not shown.
|
@ -0,0 +1,51 @@
|
|||
{
|
||||
"bos_token": {
|
||||
"content": "<s>",
|
||||
"lstrip": false,
|
||||
"normalized": false,
|
||||
"rstrip": false,
|
||||
"single_word": false
|
||||
},
|
||||
"cls_token": {
|
||||
"content": "<s>",
|
||||
"lstrip": false,
|
||||
"normalized": false,
|
||||
"rstrip": false,
|
||||
"single_word": false
|
||||
},
|
||||
"eos_token": {
|
||||
"content": "</s>",
|
||||
"lstrip": false,
|
||||
"normalized": false,
|
||||
"rstrip": false,
|
||||
"single_word": false
|
||||
},
|
||||
"mask_token": {
|
||||
"content": "<mask>",
|
||||
"lstrip": true,
|
||||
"normalized": false,
|
||||
"rstrip": false,
|
||||
"single_word": false
|
||||
},
|
||||
"pad_token": {
|
||||
"content": "<pad>",
|
||||
"lstrip": false,
|
||||
"normalized": false,
|
||||
"rstrip": false,
|
||||
"single_word": false
|
||||
},
|
||||
"sep_token": {
|
||||
"content": "</s>",
|
||||
"lstrip": false,
|
||||
"normalized": false,
|
||||
"rstrip": false,
|
||||
"single_word": false
|
||||
},
|
||||
"unk_token": {
|
||||
"content": "<unk>",
|
||||
"lstrip": false,
|
||||
"normalized": false,
|
||||
"rstrip": false,
|
||||
"single_word": false
|
||||
}
|
||||
}
|
Binary file not shown.
|
@ -0,0 +1,54 @@
|
|||
{
|
||||
"added_tokens_decoder": {
|
||||
"0": {
|
||||
"content": "<s>",
|
||||
"lstrip": false,
|
||||
"normalized": false,
|
||||
"rstrip": false,
|
||||
"single_word": false,
|
||||
"special": true
|
||||
},
|
||||
"1": {
|
||||
"content": "<pad>",
|
||||
"lstrip": false,
|
||||
"normalized": false,
|
||||
"rstrip": false,
|
||||
"single_word": false,
|
||||
"special": true
|
||||
},
|
||||
"2": {
|
||||
"content": "</s>",
|
||||
"lstrip": false,
|
||||
"normalized": false,
|
||||
"rstrip": false,
|
||||
"single_word": false,
|
||||
"special": true
|
||||
},
|
||||
"3": {
|
||||
"content": "<unk>",
|
||||
"lstrip": false,
|
||||
"normalized": false,
|
||||
"rstrip": false,
|
||||
"single_word": false,
|
||||
"special": true
|
||||
},
|
||||
"250001": {
|
||||
"content": "<mask>",
|
||||
"lstrip": true,
|
||||
"normalized": false,
|
||||
"rstrip": false,
|
||||
"single_word": false,
|
||||
"special": true
|
||||
}
|
||||
},
|
||||
"bos_token": "<s>",
|
||||
"clean_up_tokenization_spaces": true,
|
||||
"cls_token": "<s>",
|
||||
"eos_token": "</s>",
|
||||
"mask_token": "<mask>",
|
||||
"model_max_length": 512,
|
||||
"pad_token": "<pad>",
|
||||
"sep_token": "</s>",
|
||||
"tokenizer_class": "XLMRobertaTokenizer",
|
||||
"unk_token": "<unk>"
|
||||
}
|
Loading…
Reference in New Issue