first commit

This commit is contained in:
xxl 2025-03-03 15:37:55 +08:00
parent 0104e4a128
commit 63cd7b5386
11 changed files with 514 additions and 2 deletions

10
1_Pooling/config.json Normal file
View File

@ -0,0 +1,10 @@
{
"word_embedding_dimension": 768,
"pooling_mode_cls_token": false,
"pooling_mode_mean_tokens": true,
"pooling_mode_max_tokens": false,
"pooling_mode_mean_sqrt_len_tokens": false,
"pooling_mode_weightedmean_tokens": false,
"pooling_mode_lasttoken": false,
"include_prompt": true
}

274
README.md
View File

@ -1,3 +1,273 @@
# nomic-embed-text-v2-moe
---
base_model:
- nomic-ai/nomic-embed-text-v2-moe-unsupervised
library_name: sentence-transformers
pipeline_tag: sentence-similarity
tags:
- sentence-transformers
- sentence-similarity
- feature-extraction
license: apache-2.0
language:
- en
- es
- fr
- de
- it
- pt
- pl
- nl
- tr
- ja
- vi
- ru
- id
- ar
- cs
- ro
- sv
- el
- uk
- zh
- hu
- da
- 'no'
- hi
- fi
- bg
- ko
- sk
- th
- he
- ca
- lt
- fa
- ms
- sl
- lv
- mr
- bn
- sq
- cy
- be
- ml
- kn
- mk
- ur
- fy
- te
- eu
- sw
- so
- sd
- uz
- co
- hr
- gu
- ce
- eo
- jv
- la
- zu
- mn
- si
- ga
- ky
- tg
- my
- km
- mg
- pa
- sn
- ha
- ht
- su
- gd
- ny
- ps
- ku
- am
- ig
- lo
- mi
- nn
- sm
- yi
- st
- tl
- xh
- yo
- af
- ta
- tn
- ug
- az
- ba
- bs
- dv
- et
- gl
- gn
- gv
- hy
---
nomic-embed-text-v2-moe
# nomic-embed-text-v2-moe: Multilingual Mixture of Experts Text Embeddings
## Model Overview
`nomic-embed-text-v2-moe` is SoTA multilingual MoE text embedding model that excels at multilingual retrieval:
- **High Performance**: SoTA Multilingual performance compared to ~300M parameter models, competitive with models 2x in size
- **Multilinguality**: Supports ~100 languages and trained on over 1.6B pairs
- **Flexible Embedding Dimension**: Trained with [Matryoshka Embeddings](https://arxiv.org/abs/2205.13147) with 3x reductions in storage cost with minimal performance degradations
- **Fully Open-Source**: Model weights, [code](https://github.com/nomic-ai/contrastors), and training data (see code repo) released
| Model | Params (M) | Emb Dim | BEIR | MIRACL | Pretrain Data | Finetune Data | Code |
|-------|------------|----------|------|---------|---------------|---------------|------|
| **Nomic Embed v2** | 305 | 768 | 52.86 | **65.80** | ✅ | ✅ | ✅ |
| mE5 Base | 278 | 768 | 48.88 | 62.30 | ❌ | ❌ | ❌ |
| mGTE Base | 305 | 768 | 51.10 | 63.40 | ❌ | ❌ | ❌ |
| Arctic Embed v2 Base | 305 | 768 | **55.40** | 59.90 | ❌ | ❌ | ❌ |
| |
| BGE M3 | 568 | 1024 | 48.80 | **69.20** | ❌ | ✅ | ❌ |
| Arctic Embed v2 Large | 568 | 1024 | **55.65** | 66.00 | ❌ | ❌ | ❌ |
| mE5 Large | 560 | 1024 | 51.40 | 66.50 | ❌ | ❌ | ❌ |
## Model Architecture
- **Total Parameters**: 475M
- **Active Parameters During Inference**: 305M
- **Architecture Type**: Mixture of Experts (MoE)
- **MoE Configuration**: 8 experts with top-2 routing
- **Embedding Dimensions**: Supports flexible dimension from 768 to 256 through Matryoshka representation learning
- **Maximum Sequence Length**: 512 tokens
- **Languages**: Supports dozens of languages (see Performance section)
## Usage Guide
### Installation
The model can be used through SentenceTransformers and Transformers.
For best performance on GPU, please install
```bash
pip install torch transformers einops git+https://github.com/nomic-ai/megablocks.git
```
> [!IMPORTANT]
> **Important!**
> The text prompt *must* include a *task instruction prefix*, instructing the model which task is being performed.
Please use `search_query: ` before your queries/questions, and `search_document: ` before your documents.
### Transformers
If using Transformers, **make sure to prepend the task instruction prefix**.
```python
import torch
import torch.nn.functional as F
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("nomic-ai/nomic-embed-text-v2-moe")
model = AutoModel.from_pretrained("nomic-ai/nomic-embed-text-v2-moe", trust_remote_code=True)
sentences = ['search_document: Hello!', 'search_document: ¡Hola!']
def mean_pooling(model_output, attention_mask):
token_embeddings = model_output[0]
input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)
encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')
model.eval()
with torch.no_grad():
model_output = model(**encoded_input)
embeddings = mean_pooling(model_output, encoded_input['attention_mask'])
embeddings = F.normalize(embeddings, p=2, dim=1)
print(embeddings.shape)
# torch.Size([2, 768])
similarity = F.cosine_similarity(embeddings[0], embeddings[1], dim=0)
print(similarity)
# tensor(0.9118)
```
### SentenceTransformers
With SentenceTransformers, you can specify the `prompt_name` as either `"query"` or `"passage"`, and the task instruction will be included automatically.
```python
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("nomic-ai/nomic-embed-text-v2-moe", trust_remote_code=True)
sentences = ["Hello!", "¡Hola!"]
embeddings = model.encode(sentences, prompt_name="passage")
print(embeddings.shape)
# (2, 768)
similarity = model.similarity(embeddings[0], embeddings[1])
print(similarity)
# tensor([[0.9118]])
```
## Performance
nomic-embed-text-v2-moe performance on BEIR and MIRACL compared to other open-weights embedding models:
![image/png](https://cdn-uploads.huggingface.co/production/uploads/607997c83a565c15675055b3/xadjrezEIM0Q1jbgmjqO7.png)
nomic-embed-text-v2-moe performance on BEIR at 768 dimension and truncated to 256 dimensions:
![image/png](https://cdn-uploads.huggingface.co/production/uploads/607997c83a565c15675055b3/8hmhWQ_TTmlrviZFIBSxo.png)
## Best Practices
- Add appropriate prefixes to your text:
- For queries: "search_query: "
- For documents: "search_document: "
- Maximum input length is 512 tokens
- For optimal efficiency, consider using the 256-dimension embeddings if storage/compute is a concern
## Limitations
- Performance may vary across different languages
- Resource requirements may be higher than traditional dense models due to MoE architecture
- Must use `trust_remote_code=True` when loading the model to use our custom architecture implementation
## Training Details
![image/png](https://cdn-uploads.huggingface.co/production/uploads/607997c83a565c15675055b3/F0lyAtV8wXMBmxSbtIgL4.png)
- Trained on 1.6 billion high-quality pairs across multiple languages
- Uses consistency filtering to ensure high-quality training data
- Incorporates Matryoshka representation learning for dimension flexibility
- Training includes both weakly-supervised contrastive pretraining and supervised finetuning
For more details, please check out the [blog post](https://www.nomic.ai/blog/posts/nomic-embed-text-v2) and [technical report](https://www.arxiv.org/abs/2502.07972).
## Join the Nomic Community
- Nomic: [https://nomic.ai](https://nomic.ai)
- Discord: [https://discord.gg/myY5YDR8z8](https://discord.gg/myY5YDR8z8)
- Twitter: [https://twitter.com/nomic_ai](https://twitter.com/nomic_ai)
# Citation
If you find the model, dataset, or training code useful, please cite our work
```bibtex
@misc{nussbaum2025trainingsparsemixtureexperts,
title={Training Sparse Mixture Of Experts Text Embedding Models},
author={Zach Nussbaum and Brandon Duderstadt},
year={2025},
eprint={2502.07972},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2502.07972},
}
```

74
config.json Normal file
View File

@ -0,0 +1,74 @@
{
"_name_or_path": "nomic-ai/nomic-xlm-2048",
"activation_function": "gelu",
"add_pooling_layer": false,
"architectures": [
"NomicBertModel"
],
"attn_pdrop": 0.0,
"auto_map": {
"AutoConfig": "nomic-ai/nomic-bert-2048--configuration_hf_nomic_bert.NomicBertConfig",
"AutoModel": "nomic-ai/nomic-bert-2048--modeling_hf_nomic_bert.NomicBertModel",
"AutoModelForMaskedLM": "nomic-ai/nomic-bert-2048--modeling_hf_nomic_bert.NomicBertForPreTraining",
"AutoModelForMultipleChoice": "nomic-ai/nomic-bert-2048--modeling_hf_nomic_bert.NomicBertForMultipleChoice",
"AutoModelForQuestionAnswering": "nomic-ai/nomic-bert-2048--modeling_hf_nomic_bert.NomicBertForQuestionAnswering",
"AutoModelForSequenceClassification": "nomic-ai/nomic-bert-2048--modeling_hf_nomic_bert.NomicBertForSequenceClassification",
"AutoModelForTokenClassification": "nomic-ai/nomic-bert-2048--modeling_hf_nomic_bert.NomicBertForTokenClassification"
},
"bos_token_id": null,
"causal": false,
"dense_seq_output": true,
"embd_pdrop": 0.1,
"eos_token_id": null,
"expert_choice_router": false,
"ffn_div": 1,
"fused_bias_fc": true,
"fused_dropout_add_ln": true,
"initializer_range": 0.02,
"layer_norm_epsilon": 1e-05,
"max_trained_positions": 2048,
"mlp_fc1_bias": true,
"mlp_fc2_bias": true,
"model_type": "nomic_bert",
"moe_every_n_layers": 2,
"moe_impl": "megablocks",
"moe_normalize_expert_weights": false,
"moe_resid_pdrop": 0.0,
"moe_top_k": 2,
"n_embd": 768,
"n_head": 12,
"n_inner": 3072,
"n_layer": 12,
"n_positions": 2048,
"num_experts": 8,
"num_shared_experts": 0,
"pad_token_id": 1,
"pad_vocab_size_multiple": 64,
"parallel_block": false,
"parallel_block_tied_norm": false,
"prenorm": false,
"qkv_proj_bias": true,
"reorder_and_upcast_attn": false,
"resid_pdrop": 0.0,
"rotary_emb_base": 10000,
"rotary_emb_fraction": 1.0,
"rotary_emb_interleaved": false,
"rotary_emb_scale_base": null,
"rotary_scaling_factor": null,
"router_aux_loss_coef": 0.1,
"scale_attn_by_inverse_layer_idx": false,
"scale_attn_weights": true,
"summary_activation": null,
"summary_first_dropout": 0.1,
"summary_proj_to_labels": true,
"summary_type": "cls_index",
"summary_use_proj": true,
"torch_dtype": "float32",
"transformers_version": "4.44.2",
"type_vocab_size": 1,
"use_cache": true,
"use_flash_attn": true,
"use_rms_norm": null,
"use_xentropy": true,
"vocab_size": 250048
}

View File

@ -0,0 +1,20 @@
{
"__version__": {
"sentence_transformers": "3.3.0",
"transformers": "4.44.2",
"pytorch": "2.4.1+cu121"
},
"prompts": {
"query": "search_query: ",
"passage": "search_document: ",
"Classification": "classification: ",
"MultilabelClassification": "classification: ",
"Clustering": "clustering: ",
"PairClassification": "classification: ",
"STS": "classification: ",
"Summarization": "classification: ",
"Speed": "search_document: "
},
"default_prompt_name": null,
"similarity_fn_name": "cosine"
}

BIN
model.safetensors (Stored with Git LFS) Normal file

Binary file not shown.

20
modules.json Normal file
View File

@ -0,0 +1,20 @@
[
{
"idx": 0,
"name": "0",
"path": "",
"type": "sentence_transformers.models.Transformer"
},
{
"idx": 1,
"name": "1",
"path": "1_Pooling",
"type": "sentence_transformers.models.Pooling"
},
{
"idx": 2,
"name": "2",
"path": "2_Normalize",
"type": "sentence_transformers.models.Normalize"
}
]

View File

@ -0,0 +1,4 @@
{
"max_seq_length": 512,
"do_lower_case": false
}

BIN
sentencepiece.bpe.model (Stored with Git LFS) Normal file

Binary file not shown.

51
special_tokens_map.json Normal file
View File

@ -0,0 +1,51 @@
{
"bos_token": {
"content": "<s>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false
},
"cls_token": {
"content": "<s>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false
},
"eos_token": {
"content": "</s>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false
},
"mask_token": {
"content": "<mask>",
"lstrip": true,
"normalized": false,
"rstrip": false,
"single_word": false
},
"pad_token": {
"content": "<pad>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false
},
"sep_token": {
"content": "</s>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false
},
"unk_token": {
"content": "<unk>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false
}
}

BIN
tokenizer.json (Stored with Git LFS) Normal file

Binary file not shown.

54
tokenizer_config.json Normal file
View File

@ -0,0 +1,54 @@
{
"added_tokens_decoder": {
"0": {
"content": "<s>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
},
"1": {
"content": "<pad>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
},
"2": {
"content": "</s>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
},
"3": {
"content": "<unk>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
},
"250001": {
"content": "<mask>",
"lstrip": true,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
}
},
"bos_token": "<s>",
"clean_up_tokenization_spaces": true,
"cls_token": "<s>",
"eos_token": "</s>",
"mask_token": "<mask>",
"model_max_length": 512,
"pad_token": "<pad>",
"sep_token": "</s>",
"tokenizer_class": "XLMRobertaTokenizer",
"unk_token": "<unk>"
}