Go to file
Charles95 2313a855af first commit 2024-08-21 06:59:50 +00:00
1_Pooling first commit 2024-08-21 06:59:50 +00:00
onnx first commit 2024-08-21 06:59:50 +00:00
.gitattributes first commit 2024-08-21 06:59:50 +00:00
README.md first commit 2024-08-21 06:59:50 +00:00
config.json first commit 2024-08-21 06:59:50 +00:00
config_sentence_transformers.json first commit 2024-08-21 06:59:50 +00:00
merges.txt first commit 2024-08-21 06:59:50 +00:00
model.safetensors first commit 2024-08-21 06:59:50 +00:00
modules.json first commit 2024-08-21 06:59:50 +00:00
pytorch_model.bin first commit 2024-08-21 06:59:50 +00:00
sentence_bert_config.json first commit 2024-08-21 06:59:50 +00:00
special_tokens_map.json first commit 2024-08-21 06:59:50 +00:00
tokenizer.json first commit 2024-08-21 06:59:50 +00:00
tokenizer_config.json first commit 2024-08-21 06:59:50 +00:00
vocab.json first commit 2024-08-21 06:59:50 +00:00

README.md

tags inference license language model-index
sentence-transformers
feature-extraction
sentence-similarity
mteb
transformers
transformers.js
false apache-2.0
en
zh
name results
jina-embeddings-v2-base-zh
task dataset metrics
type
STS
type name config split revision
C-MTEB/AFQMC MTEB AFQMC default validation None
type value
cos_sim_pearson 48.51403119231363
type value
cos_sim_spearman 50.5928547846445
type value
euclidean_pearson 48.750436310559074
type value
euclidean_spearman 50.50950238691385
type value
manhattan_pearson 48.7866189440328
type value
manhattan_spearman 50.58692402017165
task dataset metrics
type
STS
type name config split revision
C-MTEB/ATEC MTEB ATEC default test None
type value
cos_sim_pearson 50.25985700105725
type value
cos_sim_spearman 51.28815934593989
type value
euclidean_pearson 52.70329248799904
type value
euclidean_spearman 50.94101139559258
type value
manhattan_pearson 52.6647237400892
type value
manhattan_spearman 50.922441325406176
task dataset metrics
type
Classification
type name config split revision
mteb/amazon_reviews_multi MTEB AmazonReviewsClassification (zh) zh test 1399c76144fd37290681b995c656ef9b2e06e26d
type value
accuracy 34.944
type value
f1 34.06478860660109
task dataset metrics
type
STS
type name config split revision
C-MTEB/BQ MTEB BQ default test None
type value
cos_sim_pearson 65.15667035488342
type value
cos_sim_spearman 66.07110142081
type value
euclidean_pearson 60.447598102249714
type value
euclidean_spearman 61.826575796578766
type value
manhattan_pearson 60.39364279354984
type value
manhattan_spearman 61.78743491223281
task dataset metrics
type
Clustering
type name config split revision
C-MTEB/CLSClusteringP2P MTEB CLSClusteringP2P default test None
type value
v_measure 39.96714175391701
task dataset metrics
type
Clustering
type name config split revision
C-MTEB/CLSClusteringS2S MTEB CLSClusteringS2S default test None
type value
v_measure 38.39863566717934
task dataset metrics
type
Reranking
type name config split revision
C-MTEB/CMedQAv1-reranking MTEB CMedQAv1 default test None
type value
map 83.63680381780644
type value
mrr 86.16476190476192
task dataset metrics
type
Reranking
type name config split revision
C-MTEB/CMedQAv2-reranking MTEB CMedQAv2 default test None
type value
map 83.74350667859487
type value
mrr 86.10388888888889
task dataset metrics
type
Retrieval
type name config split revision
C-MTEB/CmedqaRetrieval MTEB CmedqaRetrieval default dev None
type value
map_at_1 22.072
type value
map_at_10 32.942
type value
map_at_100 34.768
type value
map_at_1000 34.902
type value
map_at_3 29.357
type value
map_at_5 31.236000000000004
type value
mrr_at_1 34.259
type value
mrr_at_10 41.957
type value
mrr_at_100 42.982
type value
mrr_at_1000 43.042
type value
mrr_at_3 39.722
type value
mrr_at_5 40.898
type value
ndcg_at_1 34.259
type value
ndcg_at_10 39.153
type value
ndcg_at_100 46.493
type value
ndcg_at_1000 49.01
type value
ndcg_at_3 34.636
type value
ndcg_at_5 36.278
type value
precision_at_1 34.259
type value
precision_at_10 8.815000000000001
type value
precision_at_100 1.474
type value
precision_at_1000 0.179
type value
precision_at_3 19.73
type value
precision_at_5 14.174000000000001
type value
recall_at_1 22.072
type value
recall_at_10 48.484
type value
recall_at_100 79.035
type value
recall_at_1000 96.15
type value
recall_at_3 34.607
type value
recall_at_5 40.064
task dataset metrics
type
PairClassification
type name config split revision
C-MTEB/CMNLI MTEB Cmnli default validation None
type value
cos_sim_accuracy 76.7047504509922
type value
cos_sim_ap 85.26649874800871
type value
cos_sim_f1 78.13528724646915
type value
cos_sim_precision 71.57587548638132
type value
cos_sim_recall 86.01823708206688
type value
dot_accuracy 70.13830426939266
type value
dot_ap 77.01510412382171
type value
dot_f1 73.56710042713817
type value
dot_precision 63.955094991364426
type value
dot_recall 86.57937806873977
type value
euclidean_accuracy 75.53818400481059
type value
euclidean_ap 84.34668448241264
type value
euclidean_f1 77.51741608613047
type value
euclidean_precision 70.65614777756399
type value
euclidean_recall 85.85457096095394
type value
manhattan_accuracy 75.49007817197835
type value
manhattan_ap 84.40297506704299
type value
manhattan_f1 77.63185324160932
type value
manhattan_precision 70.03949595636637
type value
manhattan_recall 87.07037643207856
type value
max_accuracy 76.7047504509922
type value
max_ap 85.26649874800871
type value
max_f1 78.13528724646915
task dataset metrics
type
Retrieval
type name config split revision
C-MTEB/CovidRetrieval MTEB CovidRetrieval default dev None
type value
map_at_1 69.178
type value
map_at_10 77.523
type value
map_at_100 77.793
type value
map_at_1000 77.79899999999999
type value
map_at_3 75.878
type value
map_at_5 76.849
type value
mrr_at_1 69.44200000000001
type value
mrr_at_10 77.55
type value
mrr_at_100 77.819
type value
mrr_at_1000 77.826
type value
mrr_at_3 75.957
type value
mrr_at_5 76.916
type value
ndcg_at_1 69.44200000000001
type value
ndcg_at_10 81.217
type value
ndcg_at_100 82.45
type value
ndcg_at_1000 82.636
type value
ndcg_at_3 77.931
type value
ndcg_at_5 79.655
type value
precision_at_1 69.44200000000001
type value
precision_at_10 9.357
type value
precision_at_100 0.993
type value
precision_at_1000 0.101
type value
precision_at_3 28.1
type value
precision_at_5 17.724
type value
recall_at_1 69.178
type value
recall_at_10 92.624
type value
recall_at_100 98.209
type value
recall_at_1000 99.684
type value
recall_at_3 83.772
type value
recall_at_5 87.882
task dataset metrics
type
Retrieval
type name config split revision
C-MTEB/DuRetrieval MTEB DuRetrieval default dev None
type value
map_at_1 25.163999999999998
type value
map_at_10 76.386
type value
map_at_100 79.339
type value
map_at_1000 79.39500000000001
type value
map_at_3 52.959
type value
map_at_5 66.59
type value
mrr_at_1 87.9
type value
mrr_at_10 91.682
type value
mrr_at_100 91.747
type value
mrr_at_1000 91.751
type value
mrr_at_3 91.267
type value
mrr_at_5 91.527
type value
ndcg_at_1 87.9
type value
ndcg_at_10 84.569
type value
ndcg_at_100 87.83800000000001
type value
ndcg_at_1000 88.322
type value
ndcg_at_3 83.473
type value
ndcg_at_5 82.178
type value
precision_at_1 87.9
type value
precision_at_10 40.605000000000004
type value
precision_at_100 4.752
type value
precision_at_1000 0.488
type value
precision_at_3 74.9
type value
precision_at_5 62.96000000000001
type value
recall_at_1 25.163999999999998
type value
recall_at_10 85.97399999999999
type value
recall_at_100 96.63000000000001
type value
recall_at_1000 99.016
type value
recall_at_3 55.611999999999995
type value
recall_at_5 71.936
task dataset metrics
type
Retrieval
type name config split revision
C-MTEB/EcomRetrieval MTEB EcomRetrieval default dev None
type value
map_at_1 48.6
type value
map_at_10 58.831
type value
map_at_100 59.427
type value
map_at_1000 59.44199999999999
type value
map_at_3 56.383
type value
map_at_5 57.753
type value
mrr_at_1 48.6
type value
mrr_at_10 58.831
type value
mrr_at_100 59.427
type value
mrr_at_1000 59.44199999999999
type value
mrr_at_3 56.383
type value
mrr_at_5 57.753
type value
ndcg_at_1 48.6
type value
ndcg_at_10 63.951
type value
ndcg_at_100 66.72200000000001
type value
ndcg_at_1000 67.13900000000001
type value
ndcg_at_3 58.882
type value
ndcg_at_5 61.373
type value
precision_at_1 48.6
type value
precision_at_10 8.01
type value
precision_at_100 0.928
type value
precision_at_1000 0.096
type value
precision_at_3 22.033
type value
precision_at_5 14.44
type value
recall_at_1 48.6
type value
recall_at_10 80.10000000000001
type value
recall_at_100 92.80000000000001
type value
recall_at_1000 96.1
type value
recall_at_3 66.10000000000001
type value
recall_at_5 72.2
task dataset metrics
type
Classification
type name config split revision
C-MTEB/IFlyTek-classification MTEB IFlyTek default validation None
type value
accuracy 47.36437091188918
type value
f1 36.60946954228577
task dataset metrics
type
Classification
type name config split revision
C-MTEB/JDReview-classification MTEB JDReview default test None
type value
accuracy 79.5684803001876
type value
ap 42.671935929201524
type value
f1 73.31912729103752
task dataset metrics
type
STS
type name config split revision
C-MTEB/LCQMC MTEB LCQMC default test None
type value
cos_sim_pearson 68.62670112113864
type value
cos_sim_spearman 75.74009123170768
type value
euclidean_pearson 73.93002595958237
type value
euclidean_spearman 75.35222935003587
type value
manhattan_pearson 73.89870445158144
type value
manhattan_spearman 75.31714936339398
task dataset metrics
type
Reranking
type name config split revision
C-MTEB/Mmarco-reranking MTEB MMarcoReranking default dev None
type value
map 31.5372713650176
type value
mrr 30.163095238095238
task dataset metrics
type
Retrieval
type name config split revision
C-MTEB/MMarcoRetrieval MTEB MMarcoRetrieval default dev None
type value
map_at_1 65.054
type value
map_at_10 74.156
type value
map_at_100 74.523
type value
map_at_1000 74.535
type value
map_at_3 72.269
type value
map_at_5 73.41
type value
mrr_at_1 67.24900000000001
type value
mrr_at_10 74.78399999999999
type value
mrr_at_100 75.107
type value
mrr_at_1000 75.117
type value
mrr_at_3 73.13499999999999
type value
mrr_at_5 74.13499999999999
type value
ndcg_at_1 67.24900000000001
type value
ndcg_at_10 77.96300000000001
type value
ndcg_at_100 79.584
type value
ndcg_at_1000 79.884
type value
ndcg_at_3 74.342
type value
ndcg_at_5 76.278
type value
precision_at_1 67.24900000000001
type value
precision_at_10 9.466
type value
precision_at_100 1.027
type value
precision_at_1000 0.105
type value
precision_at_3 27.955999999999996
type value
precision_at_5 17.817
type value
recall_at_1 65.054
type value
recall_at_10 89.113
type value
recall_at_100 96.369
type value
recall_at_1000 98.714
type value
recall_at_3 79.45400000000001
type value
recall_at_5 84.06
task dataset metrics
type
Classification
type name config split revision
mteb/amazon_massive_intent MTEB MassiveIntentClassification (zh-CN) zh-CN test 31efe3c427b0bae9c22cbb560b8f15491cc6bed7
type value
accuracy 68.1977135171486
type value
f1 67.23114308718404
task dataset metrics
type
Classification
type name config split revision
mteb/amazon_massive_scenario MTEB MassiveScenarioClassification (zh-CN) zh-CN test 7d571f92784cd94a019292a1f45445077d0ef634
type value
accuracy 71.92669804976462
type value
f1 72.90628475628779
task dataset metrics
type
Retrieval
type name config split revision
C-MTEB/MedicalRetrieval MTEB MedicalRetrieval default dev None
type value
map_at_1 49.2
type value
map_at_10 54.539
type value
map_at_100 55.135
type value
map_at_1000 55.19199999999999
type value
map_at_3 53.383
type value
map_at_5 54.142999999999994
type value
mrr_at_1 49.2
type value
mrr_at_10 54.539
type value
mrr_at_100 55.135999999999996
type value
mrr_at_1000 55.19199999999999
type value
mrr_at_3 53.383
type value
mrr_at_5 54.142999999999994
type value
ndcg_at_1 49.2
type value
ndcg_at_10 57.123000000000005
type value
ndcg_at_100 60.21300000000001
type value
ndcg_at_1000 61.915
type value
ndcg_at_3 54.772
type value
ndcg_at_5 56.157999999999994
type value
precision_at_1 49.2
type value
precision_at_10 6.52
type value
precision_at_100 0.8009999999999999
type value
precision_at_1000 0.094
type value
precision_at_3 19.6
type value
precision_at_5 12.44
type value
recall_at_1 49.2
type value
recall_at_10 65.2
type value
recall_at_100 80.10000000000001
type value
recall_at_1000 93.89999999999999
type value
recall_at_3 58.8
type value
recall_at_5 62.2
task dataset metrics
type
Classification
type name config split revision
C-MTEB/MultilingualSentiment-classification MTEB MultilingualSentiment default validation None
type value
accuracy 63.29333333333334
type value
f1 63.03293854259612
task dataset metrics
type
PairClassification
type name config split revision
C-MTEB/OCNLI MTEB Ocnli default validation None
type value
cos_sim_accuracy 75.69030860855442
type value
cos_sim_ap 80.6157833772759
type value
cos_sim_f1 77.87524366471735
type value
cos_sim_precision 72.3076923076923
type value
cos_sim_recall 84.37170010559663
type value
dot_accuracy 67.78559826746074
type value
dot_ap 72.00871467527499
type value
dot_f1 72.58722247394654
type value
dot_precision 63.57142857142857
type value
dot_recall 84.58289334741288
type value
euclidean_accuracy 75.20303194369248
type value
euclidean_ap 80.98587256415605
type value
euclidean_f1 77.26396917148362
type value
euclidean_precision 71.03631532329496
type value
euclidean_recall 84.68848996832101
type value
manhattan_accuracy 75.20303194369248
type value
manhattan_ap 80.93460699513219
type value
manhattan_f1 77.124773960217
type value
manhattan_precision 67.43083003952569
type value
manhattan_recall 90.07391763463569
type value
max_accuracy 75.69030860855442
type value
max_ap 80.98587256415605
type value
max_f1 77.87524366471735
task dataset metrics
type
Classification
type name config split revision
C-MTEB/OnlineShopping-classification MTEB OnlineShopping default test None
type value
accuracy 87.00000000000001
type value
ap 83.24372135949511
type value
f1 86.95554191530607
task dataset metrics
type
STS
type name config split revision
C-MTEB/PAWSX MTEB PAWSX default test None
type value
cos_sim_pearson 37.57616811591219
type value
cos_sim_spearman 41.490259084930045
type value
euclidean_pearson 38.9155043692188
type value
euclidean_spearman 39.16056534305623
type value
manhattan_pearson 38.76569892264335
type value
manhattan_spearman 38.99891685590743
task dataset metrics
type
STS
type name config split revision
C-MTEB/QBQTC MTEB QBQTC default test None
type value
cos_sim_pearson 35.44858610359665
type value
cos_sim_spearman 38.11128146262466
type value
euclidean_pearson 31.928644189822457
type value
euclidean_spearman 34.384936631696554
type value
manhattan_pearson 31.90586687414376
type value
manhattan_spearman 34.35770153777186
task dataset metrics
type
STS
type name config split revision
mteb/sts22-crosslingual-sts MTEB STS22 (zh) zh test 6d1ba47164174a496b7fa5d3569dae26a6813b80
type value
cos_sim_pearson 66.54931957553592
type value
cos_sim_spearman 69.25068863016632
type value
euclidean_pearson 50.26525596106869
type value
euclidean_spearman 63.83352741910006
type value
manhattan_pearson 49.98798282198196
type value
manhattan_spearman 63.87649521907841
task dataset metrics
type
STS
type name config split revision
C-MTEB/STSB MTEB STSB default test None
type value
cos_sim_pearson 82.52782476625825
type value
cos_sim_spearman 82.55618986168398
type value
euclidean_pearson 78.48190631687673
type value
euclidean_spearman 78.39479731354655
type value
manhattan_pearson 78.51176592165885
type value
manhattan_spearman 78.42363787303265
task dataset metrics
type
Reranking
type name config split revision
C-MTEB/T2Reranking MTEB T2Reranking default dev None
type value
map 67.36693873615643
type value
mrr 77.83847701797939
task dataset metrics
type
Retrieval
type name config split revision
C-MTEB/T2Retrieval MTEB T2Retrieval default dev None
type value
map_at_1 25.795
type value
map_at_10 72.258
type value
map_at_100 76.049
type value
map_at_1000 76.134
type value
map_at_3 50.697
type value
map_at_5 62.324999999999996
type value
mrr_at_1 86.634
type value
mrr_at_10 89.792
type value
mrr_at_100 89.91900000000001
type value
mrr_at_1000 89.923
type value
mrr_at_3 89.224
type value
mrr_at_5 89.608
type value
ndcg_at_1 86.634
type value
ndcg_at_10 80.589
type value
ndcg_at_100 84.812
type value
ndcg_at_1000 85.662
type value
ndcg_at_3 82.169
type value
ndcg_at_5 80.619
type value
precision_at_1 86.634
type value
precision_at_10 40.389
type value
precision_at_100 4.93
type value
precision_at_1000 0.513
type value
precision_at_3 72.104
type value
precision_at_5 60.425
type value
recall_at_1 25.795
type value
recall_at_10 79.565
type value
recall_at_100 93.24799999999999
type value
recall_at_1000 97.595
type value
recall_at_3 52.583999999999996
type value
recall_at_5 66.175
task dataset metrics
type
Classification
type name config split revision
C-MTEB/TNews-classification MTEB TNews default validation None
type value
accuracy 47.648999999999994
type value
f1 46.28925837008413
task dataset metrics
type
Clustering
type name config split revision
C-MTEB/ThuNewsClusteringP2P MTEB ThuNewsClusteringP2P default test None
type value
v_measure 54.07641891287953
task dataset metrics
type
Clustering
type name config split revision
C-MTEB/ThuNewsClusteringS2S MTEB ThuNewsClusteringS2S default test None
type value
v_measure 53.423702062353954
task dataset metrics
type
Retrieval
type name config split revision
C-MTEB/VideoRetrieval MTEB VideoRetrieval default dev None
type value
map_at_1 55.7
type value
map_at_10 65.923
type value
map_at_100 66.42
type value
map_at_1000 66.431
type value
map_at_3 63.9
type value
map_at_5 65.225
type value
mrr_at_1 55.60000000000001
type value
mrr_at_10 65.873
type value
mrr_at_100 66.36999999999999
type value
mrr_at_1000 66.381
type value
mrr_at_3 63.849999999999994
type value
mrr_at_5 65.17500000000001
type value
ndcg_at_1 55.7
type value
ndcg_at_10 70.621
type value
ndcg_at_100 72.944
type value
ndcg_at_1000 73.25399999999999
type value
ndcg_at_3 66.547
type value
ndcg_at_5 68.93599999999999
type value
precision_at_1 55.7
type value
precision_at_10 8.52
type value
precision_at_100 0.958
type value
precision_at_1000 0.098
type value
precision_at_3 24.733
type value
precision_at_5 16
type value
recall_at_1 55.7
type value
recall_at_10 85.2
type value
recall_at_100 95.8
type value
recall_at_1000 98.3
type value
recall_at_3 74.2
type value
recall_at_5 80
task dataset metrics
type
Classification
type name config split revision
C-MTEB/waimai-classification MTEB Waimai default test None
type value
accuracy 84.54
type value
ap 66.13603199670062
type value
f1 82.61420654584116



Finetuner logo: Finetuner helps you to create experiments in order to improve embeddings on search tasks. It accompanies you to deliver the last mile of performance-tuning for neural search applications.

The text embedding set trained by Jina AI.

Quick Start

The easiest way to starting using jina-embeddings-v2-base-zh is to use Jina AI's Embedding API.

Intended Usage & Model Info

jina-embeddings-v2-base-zh is a Chinese/English bilingual text embedding model supporting 8192 sequence length. It is based on a BERT architecture (JinaBERT) that supports the symmetric bidirectional variant of ALiBi to allow longer sequence length. We have designed it for high performance in mono-lingual & cross-lingual applications and trained it specifically to support mixed Chinese-English input without bias. Additionally, we provide the following embedding models:

jina-embeddings-v2-base-zh 是支持中英双语的文本向量模型,它支持长达8192字符的文本编码。 该模型的研发基于BERT架构(JinaBERT)JinaBERT是在BERT架构基础上的改进首次将ALiBi应用到编码器架构中以支持更长的序列。 不同于以往的单语言/多语言向量模型,我们设计双语模型来更好的支持单语言(中搜中)以及跨语言(中搜英)文档检索。 除此之外,我们也提供其它向量模型:

Data & Parameters

The data and training details are described in this technical report.

Usage

Please apply mean pooling when integrating the model.

Why mean pooling?

mean poooling takes all token embeddings from model output and averaging them at sentence/paragraph level. It has been proved to be the most effective way to produce high-quality sentence embeddings. We offer an encode function to deal with this.

However, if you would like to do it without using the default encode function:

import torch
import torch.nn.functional as F
from transformers import AutoTokenizer, AutoModel

def mean_pooling(model_output, attention_mask):
    token_embeddings = model_output[0]
    input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
    return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)

sentences = ['How is the weather today?', '今天天气怎么样?']

tokenizer = AutoTokenizer.from_pretrained('jinaai/jina-embeddings-v2-base-zh')
model = AutoModel.from_pretrained('jinaai/jina-embeddings-v2-base-zh', trust_remote_code=True, torch_dtype=torch.bfloat16)

encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')

with torch.no_grad():
    model_output = model(**encoded_input)

embeddings = mean_pooling(model_output, encoded_input['attention_mask'])
embeddings = F.normalize(embeddings, p=2, dim=1)

You can use Jina Embedding models directly from transformers package.

!pip install transformers
import torch
from transformers import AutoModel
from numpy.linalg import norm

cos_sim = lambda a,b: (a @ b.T) / (norm(a)*norm(b))
model = AutoModel.from_pretrained('jinaai/jina-embeddings-v2-base-zh', trust_remote_code=True, torch_dtype=torch.bfloat16)
embeddings = model.encode(['How is the weather today?', '今天天气怎么样?'])
print(cos_sim(embeddings[0], embeddings[1]))

If you only want to handle shorter sequence, such as 2k, pass the max_length parameter to the encode function:

embeddings = model.encode(
    ['Very long ... document'],
    max_length=2048
)

If you want to use the model together with the sentence-transformers package, make sure that you have installed the latest release and set trust_remote_code=True as well:

!pip install -U sentence-transformers
from sentence_transformers import SentenceTransformer
from numpy.linalg import norm

cos_sim = lambda a,b: (a @ b.T) / (norm(a)*norm(b))
model = SentenceTransformer('jinaai/jina-embeddings-v2-base-zh', trust_remote_code=True)
embeddings = model.encode(['How is the weather today?', '今天天气怎么样?'])
print(cos_sim(embeddings[0], embeddings[1]))

Using the its latest release (v2.3.0) sentence-transformers also supports Jina embeddings (Please make sure that you are logged into huggingface as well):

!pip install -U sentence-transformers
from sentence_transformers import SentenceTransformer
from sentence_transformers.util import cos_sim

model = SentenceTransformer(
    "jinaai/jina-embeddings-v2-base-zh", # switch to en/zh for English or Chinese
    trust_remote_code=True
)

# control your input sequence length up to 8192
model.max_seq_length = 1024

embeddings = model.encode([
    'How is the weather today?',
    '今天天气怎么样?'
])
print(cos_sim(embeddings[0], embeddings[1]))

Alternatives to Using Transformers Package

  1. Managed SaaS: Get started with a free key on Jina AI's Embedding API.
  2. Private and high-performance deployment: Get started by picking from our suite of models and deploy them on AWS Sagemaker.

Use Jina Embeddings for RAG

According to the latest blog post from LLamaIndex,

In summary, to achieve the peak performance in both hit rate and MRR, the combination of OpenAI or JinaAI-Base embeddings with the CohereRerank/bge-reranker-large reranker stands out.

Trouble Shooting

Loading of Model Code failed

If you forgot to pass the trust_remote_code=True flag when calling AutoModel.from_pretrained or initializing the model via the SentenceTransformer class, you will receive an error that the model weights could not be initialized. This is caused by tranformers falling back to creating a default BERT model, instead of a jina-embedding model:

Some weights of the model checkpoint at jinaai/jina-embeddings-v2-base-zh were not used when initializing BertModel: ['encoder.layer.2.mlp.layernorm.weight', 'encoder.layer.3.mlp.layernorm.weight', 'encoder.layer.10.mlp.wo.bias', 'encoder.layer.5.mlp.wo.bias', 'encoder.layer.2.mlp.layernorm.bias', 'encoder.layer.1.mlp.gated_layers.weight', 'encoder.layer.5.mlp.gated_layers.weight', 'encoder.layer.8.mlp.layernorm.bias', ...

User is not logged into Huggingface

The model is only availabe under gated access. This means you need to be logged into huggingface load load it. If you receive the following error, you need to provide an access token, either by using the huggingface-cli or providing the token via an environment variable as described above:

OSError: jinaai/jina-embeddings-v2-base-zh is not a local folder and is not a valid model identifier listed on 'https://huggingface.co/models'
If this is a private repository, make sure to pass a token having permission to this repo with `use_auth_token` or log in with `huggingface-cli login` and pass `use_auth_token=True`.

Contact

Join our Discord community and chat with other community members about ideas.

Citation

If you find Jina Embeddings useful in your research, please cite the following paper:

@article{mohr2024multi,
  title={Multi-Task Contrastive Learning for 8192-Token Bilingual Text Embeddings},
  author={Mohr, Isabelle and Krimmel, Markus and Sturua, Saba and Akram, Mohammad Kalim and Koukounas, Andreas and G{\"u}nther, Michael and Mastrapas, Georgios and Ravishankar, Vinit and Mart{\'\i}nez, Joan Fontanals and Wang, Feng and others},
  journal={arXiv preprint arXiv:2402.17016},
  year={2024}
}