jina-embeddings-v2-base-zh

Go to file

Charles95 2313a855af first commit		2024-08-21 06:59:50 +00:00
1_Pooling	first commit	2024-08-21 06:59:50 +00:00
onnx	first commit	2024-08-21 06:59:50 +00:00
.gitattributes	first commit	2024-08-21 06:59:50 +00:00
README.md	first commit	2024-08-21 06:59:50 +00:00
config.json	first commit	2024-08-21 06:59:50 +00:00
config_sentence_transformers.json	first commit	2024-08-21 06:59:50 +00:00
merges.txt	first commit	2024-08-21 06:59:50 +00:00
model.safetensors	first commit	2024-08-21 06:59:50 +00:00
modules.json	first commit	2024-08-21 06:59:50 +00:00
pytorch_model.bin	first commit	2024-08-21 06:59:50 +00:00
sentence_bert_config.json	first commit	2024-08-21 06:59:50 +00:00
special_tokens_map.json	first commit	2024-08-21 06:59:50 +00:00
tokenizer.json	first commit	2024-08-21 06:59:50 +00:00
tokenizer_config.json	first commit	2024-08-21 06:59:50 +00:00
vocab.json	first commit	2024-08-21 06:59:50 +00:00

README.md

tags

inference

license

language

model-index

sentence-transformers

feature-extraction

sentence-similarity

mteb

transformers

transformers.js

false

apache-2.0

name

results

jina-embeddings-v2-base-zh

task

dataset

metrics

type
STS

type	name	config	split	revision
C-MTEB/AFQMC	MTEB AFQMC	default	validation	None

type	value
cos_sim_pearson	48.51403119231363

type	value
cos_sim_spearman	50.5928547846445

type	value
euclidean_pearson	48.750436310559074

type	value
euclidean_spearman	50.50950238691385

type	value
manhattan_pearson	48.7866189440328

type	value
manhattan_spearman	50.58692402017165

task

dataset

metrics

type
STS

type	name	config	split	revision
C-MTEB/ATEC	MTEB ATEC	default	test	None

type	value
cos_sim_pearson	50.25985700105725

type	value
cos_sim_spearman	51.28815934593989

type	value
euclidean_pearson	52.70329248799904

type	value
euclidean_spearman	50.94101139559258

type	value
manhattan_pearson	52.6647237400892

type	value
manhattan_spearman	50.922441325406176

task

dataset

metrics

type
Classification

type	name	config	split	revision
mteb/amazon_reviews_multi	MTEB AmazonReviewsClassification (zh)	zh	test	1399c76144fd37290681b995c656ef9b2e06e26d

type	value
accuracy	34.944

type	value
f1	34.06478860660109

task

dataset

metrics

type
STS

type	name	config	split	revision
C-MTEB/BQ	MTEB BQ	default	test	None

type	value
cos_sim_pearson	65.15667035488342

type	value
cos_sim_spearman	66.07110142081

type	value
euclidean_pearson	60.447598102249714

type	value
euclidean_spearman	61.826575796578766

type	value
manhattan_pearson	60.39364279354984

type	value
manhattan_spearman	61.78743491223281

task

dataset

metrics

type
Clustering

type	name	config	split	revision
C-MTEB/CLSClusteringP2P	MTEB CLSClusteringP2P	default	test	None

type	value
v_measure	39.96714175391701

task

dataset

metrics

type
Clustering

type	name	config	split	revision
C-MTEB/CLSClusteringS2S	MTEB CLSClusteringS2S	default	test	None

type	value
v_measure	38.39863566717934

task

dataset

metrics

type
Reranking

type	name	config	split	revision
C-MTEB/CMedQAv1-reranking	MTEB CMedQAv1	default	test	None

type	value
map	83.63680381780644

type	value
mrr	86.16476190476192

task

dataset

metrics

type
Reranking

type	name	config	split	revision
C-MTEB/CMedQAv2-reranking	MTEB CMedQAv2	default	test	None

type	value
map	83.74350667859487

type	value
mrr	86.10388888888889

task

dataset

metrics

type
Retrieval

type	name	config	split	revision
C-MTEB/CmedqaRetrieval	MTEB CmedqaRetrieval	default	dev	None

type	value
map_at_1	22.072

type	value
map_at_10	32.942

type	value
map_at_100	34.768

type	value
map_at_1000	34.902

type	value
map_at_3	29.357

type	value
map_at_5	31.236000000000004

type	value
mrr_at_1	34.259

type	value
mrr_at_10	41.957

type	value
mrr_at_100	42.982

type	value
mrr_at_1000	43.042

type	value
mrr_at_3	39.722

type	value
mrr_at_5	40.898

type	value
ndcg_at_1	34.259

type	value
ndcg_at_10	39.153

type	value
ndcg_at_100	46.493

type	value
ndcg_at_1000	49.01

type	value
ndcg_at_3	34.636

type	value
ndcg_at_5	36.278

type	value
precision_at_1	34.259

type	value
precision_at_10	8.815000000000001

type	value
precision_at_100	1.474

type	value
precision_at_1000	0.179

type	value
precision_at_3	19.73

type	value
precision_at_5	14.174000000000001

type	value
recall_at_1	22.072

type	value
recall_at_10	48.484

type	value
recall_at_100	79.035

type	value
recall_at_1000	96.15

type	value
recall_at_3	34.607

type	value
recall_at_5	40.064

task

dataset

metrics

type
PairClassification

type	name	config	split	revision
C-MTEB/CMNLI	MTEB Cmnli	default	validation	None

type	value
cos_sim_accuracy	76.7047504509922

type	value
cos_sim_ap	85.26649874800871

type	value
cos_sim_f1	78.13528724646915

type	value
cos_sim_precision	71.57587548638132

type	value
cos_sim_recall	86.01823708206688

type	value
dot_accuracy	70.13830426939266

type	value
dot_ap	77.01510412382171

type	value
dot_f1	73.56710042713817

type	value
dot_precision	63.955094991364426

type	value
dot_recall	86.57937806873977

type	value
euclidean_accuracy	75.53818400481059

type	value
euclidean_ap	84.34668448241264

type	value
euclidean_f1	77.51741608613047

type	value
euclidean_precision	70.65614777756399

type	value
euclidean_recall	85.85457096095394

type	value
manhattan_accuracy	75.49007817197835

type	value
manhattan_ap	84.40297506704299

type	value
manhattan_f1	77.63185324160932

type	value
manhattan_precision	70.03949595636637

type	value
manhattan_recall	87.07037643207856

type	value
max_accuracy	76.7047504509922

type	value
max_ap	85.26649874800871

type	value
max_f1	78.13528724646915

task

dataset

metrics

type
Retrieval

type	name	config	split	revision
C-MTEB/CovidRetrieval	MTEB CovidRetrieval	default	dev	None

type	value
map_at_1	69.178

type	value
map_at_10	77.523

type	value
map_at_100	77.793

type	value
map_at_1000	77.79899999999999

type	value
map_at_3	75.878

type	value
map_at_5	76.849

type	value
mrr_at_1	69.44200000000001

type	value
mrr_at_10	77.55

type	value
mrr_at_100	77.819

type	value
mrr_at_1000	77.826

type	value
mrr_at_3	75.957

type	value
mrr_at_5	76.916

type	value
ndcg_at_1	69.44200000000001

type	value
ndcg_at_10	81.217

type	value
ndcg_at_100	82.45

type	value
ndcg_at_1000	82.636

type	value
ndcg_at_3	77.931

type	value
ndcg_at_5	79.655

type	value
precision_at_1	69.44200000000001

type	value
precision_at_10	9.357

type	value
precision_at_100	0.993

type	value
precision_at_1000	0.101

type	value
precision_at_3	28.1

type	value
precision_at_5	17.724

type	value
recall_at_1	69.178

type	value
recall_at_10	92.624

type	value
recall_at_100	98.209

type	value
recall_at_1000	99.684

type	value
recall_at_3	83.772

type	value
recall_at_5	87.882

task

dataset

metrics

type
Retrieval

type	name	config	split	revision
C-MTEB/DuRetrieval	MTEB DuRetrieval	default	dev	None

type	value
map_at_1	25.163999999999998

type	value
map_at_10	76.386

type	value
map_at_100	79.339

type	value
map_at_1000	79.39500000000001

type	value
map_at_3	52.959

type	value
map_at_5	66.59

type	value
mrr_at_1	87.9

type	value
mrr_at_10	91.682

type	value
mrr_at_100	91.747

type	value
mrr_at_1000	91.751

type	value
mrr_at_3	91.267

type	value
mrr_at_5	91.527

type	value
ndcg_at_1	87.9

type	value
ndcg_at_10	84.569

type	value
ndcg_at_100	87.83800000000001

type	value
ndcg_at_1000	88.322

type	value
ndcg_at_3	83.473

type	value
ndcg_at_5	82.178

type	value
precision_at_1	87.9

type	value
precision_at_10	40.605000000000004

type	value
precision_at_100	4.752

type	value
precision_at_1000	0.488

type	value
precision_at_3	74.9

type	value
precision_at_5	62.96000000000001

type	value
recall_at_1	25.163999999999998

type	value
recall_at_10	85.97399999999999

type	value
recall_at_100	96.63000000000001

type	value
recall_at_1000	99.016

type	value
recall_at_3	55.611999999999995

type	value
recall_at_5	71.936

task

dataset

metrics

type
Retrieval

type	name	config	split	revision
C-MTEB/EcomRetrieval	MTEB EcomRetrieval	default	dev	None

type	value
map_at_1	48.6

type	value
map_at_10	58.831

type	value
map_at_100	59.427

type	value
map_at_1000	59.44199999999999

type	value
map_at_3	56.383

type	value
map_at_5	57.753

type	value
mrr_at_1	48.6

type	value
mrr_at_10	58.831

type	value
mrr_at_100	59.427

type	value
mrr_at_1000	59.44199999999999

type	value
mrr_at_3	56.383

type	value
mrr_at_5	57.753

type	value
ndcg_at_1	48.6

type	value
ndcg_at_10	63.951

type	value
ndcg_at_100	66.72200000000001

type	value
ndcg_at_1000	67.13900000000001

type	value
ndcg_at_3	58.882

type	value
ndcg_at_5	61.373

type	value
precision_at_1	48.6

type	value
precision_at_10	8.01

type	value
precision_at_100	0.928

type	value
precision_at_1000	0.096

type	value
precision_at_3	22.033

type	value
precision_at_5	14.44

type	value
recall_at_1	48.6

type	value
recall_at_10	80.10000000000001

type	value
recall_at_100	92.80000000000001

type	value
recall_at_1000	96.1

type	value
recall_at_3	66.10000000000001

type	value
recall_at_5	72.2

task

dataset

metrics

type
Classification

type	name	config	split	revision
C-MTEB/IFlyTek-classification	MTEB IFlyTek	default	validation	None

type	value
accuracy	47.36437091188918

type	value
f1	36.60946954228577

task

dataset

metrics

type
Classification

type	name	config	split	revision
C-MTEB/JDReview-classification	MTEB JDReview	default	test	None

type	value
accuracy	79.5684803001876

type	value
ap	42.671935929201524

type	value
f1	73.31912729103752

task

dataset

metrics

type
STS

type	name	config	split	revision
C-MTEB/LCQMC	MTEB LCQMC	default	test	None

type	value
cos_sim_pearson	68.62670112113864

type	value
cos_sim_spearman	75.74009123170768

type	value
euclidean_pearson	73.93002595958237

type	value
euclidean_spearman	75.35222935003587

type	value
manhattan_pearson	73.89870445158144

type	value
manhattan_spearman	75.31714936339398

task

dataset

metrics

type
Reranking

type	name	config	split	revision
C-MTEB/Mmarco-reranking	MTEB MMarcoReranking	default	dev	None

type	value
map	31.5372713650176

type	value
mrr	30.163095238095238

task

dataset

metrics

type
Retrieval

type	name	config	split	revision
C-MTEB/MMarcoRetrieval	MTEB MMarcoRetrieval	default	dev	None

type	value
map_at_1	65.054

type	value
map_at_10	74.156

type	value
map_at_100	74.523

type	value
map_at_1000	74.535

type	value
map_at_3	72.269

type	value
map_at_5	73.41

type	value
mrr_at_1	67.24900000000001

type	value
mrr_at_10	74.78399999999999

type	value
mrr_at_100	75.107

type	value
mrr_at_1000	75.117

type	value
mrr_at_3	73.13499999999999

type	value
mrr_at_5	74.13499999999999

type	value
ndcg_at_1	67.24900000000001

type	value
ndcg_at_10	77.96300000000001

type	value
ndcg_at_100	79.584

type	value
ndcg_at_1000	79.884

type	value
ndcg_at_3	74.342

type	value
ndcg_at_5	76.278

type	value
precision_at_1	67.24900000000001

type	value
precision_at_10	9.466

type	value
precision_at_100	1.027

type	value
precision_at_1000	0.105

type	value
precision_at_3	27.955999999999996

type	value
precision_at_5	17.817

type	value
recall_at_1	65.054

type	value
recall_at_10	89.113

type	value
recall_at_100	96.369

type	value
recall_at_1000	98.714

type	value
recall_at_3	79.45400000000001

type	value
recall_at_5	84.06

task

dataset

metrics

type
Classification

type	name	config	split	revision
mteb/amazon_massive_intent	MTEB MassiveIntentClassification (zh-CN)	zh-CN	test	31efe3c427b0bae9c22cbb560b8f15491cc6bed7

type	value
accuracy	68.1977135171486

type	value
f1	67.23114308718404

task

dataset

metrics

type
Classification

type	name	config	split	revision
mteb/amazon_massive_scenario	MTEB MassiveScenarioClassification (zh-CN)	zh-CN	test	7d571f92784cd94a019292a1f45445077d0ef634

type	value
accuracy	71.92669804976462

type	value
f1	72.90628475628779

task

dataset

metrics

type
Retrieval

type	name	config	split	revision
C-MTEB/MedicalRetrieval	MTEB MedicalRetrieval	default	dev	None

type	value
map_at_1	49.2

type	value
map_at_10	54.539

type	value
map_at_100	55.135

type	value
map_at_1000	55.19199999999999

type	value
map_at_3	53.383

type	value
map_at_5	54.142999999999994

type	value
mrr_at_1	49.2

type	value
mrr_at_10	54.539

type	value
mrr_at_100	55.135999999999996

type	value
mrr_at_1000	55.19199999999999

type	value
mrr_at_3	53.383

type	value
mrr_at_5	54.142999999999994

type	value
ndcg_at_1	49.2

type	value
ndcg_at_10	57.123000000000005

type	value
ndcg_at_100	60.21300000000001

type	value
ndcg_at_1000	61.915

type	value
ndcg_at_3	54.772

type	value
ndcg_at_5	56.157999999999994

type	value
precision_at_1	49.2

type	value
precision_at_10	6.52

type	value
precision_at_100	0.8009999999999999

type	value
precision_at_1000	0.094

type	value
precision_at_3	19.6

type	value
precision_at_5	12.44

type	value
recall_at_1	49.2

type	value
recall_at_10	65.2

type	value
recall_at_100	80.10000000000001

type	value
recall_at_1000	93.89999999999999

type	value
recall_at_3	58.8

type	value
recall_at_5	62.2

task

dataset

metrics

type
Classification

type	name	config	split	revision
C-MTEB/MultilingualSentiment-classification	MTEB MultilingualSentiment	default	validation	None

type	value
accuracy	63.29333333333334

type	value
f1	63.03293854259612

task

dataset

metrics

type
PairClassification

type	name	config	split	revision
C-MTEB/OCNLI	MTEB Ocnli	default	validation	None

type	value
cos_sim_accuracy	75.69030860855442

type	value
cos_sim_ap	80.6157833772759

type	value
cos_sim_f1	77.87524366471735

type	value
cos_sim_precision	72.3076923076923

type	value
cos_sim_recall	84.37170010559663

type	value
dot_accuracy	67.78559826746074

type	value
dot_ap	72.00871467527499

type	value
dot_f1	72.58722247394654

type	value
dot_precision	63.57142857142857

type	value
dot_recall	84.58289334741288

type	value
euclidean_accuracy	75.20303194369248

type	value
euclidean_ap	80.98587256415605

type	value
euclidean_f1	77.26396917148362

type	value
euclidean_precision	71.03631532329496

type	value
euclidean_recall	84.68848996832101

type	value
manhattan_accuracy	75.20303194369248

type	value
manhattan_ap	80.93460699513219

type	value
manhattan_f1	77.124773960217

type	value
manhattan_precision	67.43083003952569

type	value
manhattan_recall	90.07391763463569

type	value
max_accuracy	75.69030860855442

type	value
max_ap	80.98587256415605

type	value
max_f1	77.87524366471735

task

dataset

metrics

type
Classification

type	name	config	split	revision
C-MTEB/OnlineShopping-classification	MTEB OnlineShopping	default	test	None

type	value
accuracy	87.00000000000001

type	value
ap	83.24372135949511

type	value
f1	86.95554191530607

task

dataset

metrics

type
STS

type	name	config	split	revision
C-MTEB/PAWSX	MTEB PAWSX	default	test	None

type	value
cos_sim_pearson	37.57616811591219

type	value
cos_sim_spearman	41.490259084930045

type	value
euclidean_pearson	38.9155043692188

type	value
euclidean_spearman	39.16056534305623

type	value
manhattan_pearson	38.76569892264335

type	value
manhattan_spearman	38.99891685590743

task

dataset

metrics

type
STS

type	name	config	split	revision
C-MTEB/QBQTC	MTEB QBQTC	default	test	None

type	value
cos_sim_pearson	35.44858610359665

type	value
cos_sim_spearman	38.11128146262466

type	value
euclidean_pearson	31.928644189822457

type	value
euclidean_spearman	34.384936631696554

type	value
manhattan_pearson	31.90586687414376

type	value
manhattan_spearman	34.35770153777186

task

dataset

metrics

type
STS

type	name	config	split	revision
mteb/sts22-crosslingual-sts	MTEB STS22 (zh)	zh	test	6d1ba47164174a496b7fa5d3569dae26a6813b80

type	value
cos_sim_pearson	66.54931957553592

type	value
cos_sim_spearman	69.25068863016632

type	value
euclidean_pearson	50.26525596106869

type	value
euclidean_spearman	63.83352741910006

type	value
manhattan_pearson	49.98798282198196

type	value
manhattan_spearman	63.87649521907841

task

dataset

metrics

type
STS

type	name	config	split	revision
C-MTEB/STSB	MTEB STSB	default	test	None

type	value
cos_sim_pearson	82.52782476625825

type	value
cos_sim_spearman	82.55618986168398

type	value
euclidean_pearson	78.48190631687673

type	value
euclidean_spearman	78.39479731354655

type	value
manhattan_pearson	78.51176592165885

type	value
manhattan_spearman	78.42363787303265

task

dataset

metrics

type
Reranking

type	name	config	split	revision
C-MTEB/T2Reranking	MTEB T2Reranking	default	dev	None

type	value
map	67.36693873615643

type	value
mrr	77.83847701797939

task

dataset

metrics

type
Retrieval

type	name	config	split	revision
C-MTEB/T2Retrieval	MTEB T2Retrieval	default	dev	None

type	value
map_at_1	25.795

type	value
map_at_10	72.258

type	value
map_at_100	76.049

type	value
map_at_1000	76.134

type	value
map_at_3	50.697

type	value
map_at_5	62.324999999999996

type	value
mrr_at_1	86.634

type	value
mrr_at_10	89.792

type	value
mrr_at_100	89.91900000000001

type	value
mrr_at_1000	89.923

type	value
mrr_at_3	89.224

type	value
mrr_at_5	89.608

type	value
ndcg_at_1	86.634

type	value
ndcg_at_10	80.589

type	value
ndcg_at_100	84.812

type	value
ndcg_at_1000	85.662

type	value
ndcg_at_3	82.169

type	value
ndcg_at_5	80.619

type	value
precision_at_1	86.634

type	value
precision_at_10	40.389

type	value
precision_at_100	4.93

type	value
precision_at_1000	0.513

type	value
precision_at_3	72.104

type	value
precision_at_5	60.425

type	value
recall_at_1	25.795

type	value
recall_at_10	79.565

type	value
recall_at_100	93.24799999999999

type	value
recall_at_1000	97.595

type	value
recall_at_3	52.583999999999996

type	value
recall_at_5	66.175

task

dataset

metrics

type
Classification

type	name	config	split	revision
C-MTEB/TNews-classification	MTEB TNews	default	validation	None

type	value
accuracy	47.648999999999994

type	value
f1	46.28925837008413

task

dataset

metrics

type
Clustering

type	name	config	split	revision
C-MTEB/ThuNewsClusteringP2P	MTEB ThuNewsClusteringP2P	default	test	None

type	value
v_measure	54.07641891287953

task

dataset

metrics

type
Clustering

type	name	config	split	revision
C-MTEB/ThuNewsClusteringS2S	MTEB ThuNewsClusteringS2S	default	test	None

type	value
v_measure	53.423702062353954

task

dataset

metrics

type
Retrieval

type	name	config	split	revision
C-MTEB/VideoRetrieval	MTEB VideoRetrieval	default	dev	None

type	value
map_at_1	55.7

type	value
map_at_10	65.923

type	value
map_at_100	66.42

type	value
map_at_1000	66.431

type	value
map_at_3	63.9

type	value
map_at_5	65.225

type	value
mrr_at_1	55.60000000000001

type	value
mrr_at_10	65.873

type	value
mrr_at_100	66.36999999999999

type	value
mrr_at_1000	66.381

type	value
mrr_at_3	63.849999999999994

type	value
mrr_at_5	65.17500000000001

type	value
ndcg_at_1	55.7

type	value
ndcg_at_10	70.621

type	value
ndcg_at_100	72.944

type	value
ndcg_at_1000	73.25399999999999

type	value
ndcg_at_3	66.547

type	value
ndcg_at_5	68.93599999999999

type	value
precision_at_1	55.7

type	value
precision_at_10	8.52

type	value
precision_at_100	0.958

type	value
precision_at_1000	0.098

type	value
precision_at_3	24.733

type	value
precision_at_5	16

type	value
recall_at_1	55.7

type	value
recall_at_10	85.2

type	value
recall_at_100	95.8

type	value
recall_at_1000	98.3

type	value
recall_at_3	74.2

type	value
recall_at_5	80

task

dataset

metrics

type
Classification

type	name	config	split	revision
C-MTEB/waimai-classification	MTEB Waimai	default	test	None

type	value
accuracy	84.54

type	value
ap	66.13603199670062

type	value
f1	82.61420654584116

Finetuner logo: Finetuner helps you to create experiments in order to improve embeddings on search tasks. It accompanies you to deliver the last mile of performance-tuning for neural search applications.

The text embedding set trained by Jina AI.

Quick Start

The easiest way to starting using jina-embeddings-v2-base-zh is to use Jina AI's Embedding API.

Intended Usage & Model Info

jina-embeddings-v2-base-zh is a Chinese/English bilingual text embedding model supporting 8192 sequence length. It is based on a BERT architecture (JinaBERT) that supports the symmetric bidirectional variant of ALiBi to allow longer sequence length. We have designed it for high performance in mono-lingual & cross-lingual applications and trained it specifically to support mixed Chinese-English input without bias. Additionally, we provide the following embedding models:

jina-embeddings-v2-base-zh 是支持中英双语的文本向量模型，它支持长达8192字符的文本编码。该模型的研发基于BERT架构(JinaBERT)，JinaBERT是在BERT架构基础上的改进，首次将ALiBi应用到编码器架构中以支持更长的序列。不同于以往的单语言/多语言向量模型，我们设计双语模型来更好的支持单语言（中搜中）以及跨语言（中搜英）文档检索。除此之外，我们也提供其它向量模型:

jina-embeddings-v2-small-en: 33 million parameters.
jina-embeddings-v2-base-en: 137 million parameters.
jina-embeddings-v2-base-zh: 161 million parameters Chinese-English Bilingual embeddings (you are here).
jina-embeddings-v2-base-de: 161 million parameters German-English Bilingual embeddings.
jina-embeddings-v2-base-es: Spanish-English Bilingual embeddings (soon).
jina-embeddings-v2-base-code: 161 million parameters code embeddings.

Data & Parameters

The data and training details are described in this technical report.

Usage

Please apply mean pooling when integrating the model.

Why mean pooling?

mean poooling takes all token embeddings from model output and averaging them at sentence/paragraph level. It has been proved to be the most effective way to produce high-quality sentence embeddings. We offer an encode function to deal with this.

However, if you would like to do it without using the default encode function:

import torch
import torch.nn.functional as F
from transformers import AutoTokenizer, AutoModel

def mean_pooling(model_output, attention_mask):
    token_embeddings = model_output[0]
    input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
    return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)

sentences = ['How is the weather today?', '今天天气怎么样?']

tokenizer = AutoTokenizer.from_pretrained('jinaai/jina-embeddings-v2-base-zh')
model = AutoModel.from_pretrained('jinaai/jina-embeddings-v2-base-zh', trust_remote_code=True, torch_dtype=torch.bfloat16)

encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')

with torch.no_grad():
    model_output = model(**encoded_input)

embeddings = mean_pooling(model_output, encoded_input['attention_mask'])
embeddings = F.normalize(embeddings, p=2, dim=1)

You can use Jina Embedding models directly from transformers package.

!pip install transformers
import torch
from transformers import AutoModel
from numpy.linalg import norm

cos_sim = lambda a,b: (a @ b.T) / (norm(a)*norm(b))
model = AutoModel.from_pretrained('jinaai/jina-embeddings-v2-base-zh', trust_remote_code=True, torch_dtype=torch.bfloat16)
embeddings = model.encode(['How is the weather today?', '今天天气怎么样?'])
print(cos_sim(embeddings[0], embeddings[1]))

If you only want to handle shorter sequence, such as 2k, pass the max_length parameter to the encode function:

embeddings = model.encode(
    ['Very long ... document'],
    max_length=2048
)

If you want to use the model together with the sentence-transformers package, make sure that you have installed the latest release and set trust_remote_code=True as well:

!pip install -U sentence-transformers
from sentence_transformers import SentenceTransformer
from numpy.linalg import norm

cos_sim = lambda a,b: (a @ b.T) / (norm(a)*norm(b))
model = SentenceTransformer('jinaai/jina-embeddings-v2-base-zh', trust_remote_code=True)
embeddings = model.encode(['How is the weather today?', '今天天气怎么样?'])
print(cos_sim(embeddings[0], embeddings[1]))

Using the its latest release (v2.3.0) sentence-transformers also supports Jina embeddings (Please make sure that you are logged into huggingface as well):

!pip install -U sentence-transformers
from sentence_transformers import SentenceTransformer
from sentence_transformers.util import cos_sim

model = SentenceTransformer(
    "jinaai/jina-embeddings-v2-base-zh", # switch to en/zh for English or Chinese
    trust_remote_code=True
)

# control your input sequence length up to 8192
model.max_seq_length = 1024

embeddings = model.encode([
    'How is the weather today?',
    '今天天气怎么样?'
])
print(cos_sim(embeddings[0], embeddings[1]))

Alternatives to Using Transformers Package

Managed SaaS: Get started with a free key on Jina AI's Embedding API.
Private and high-performance deployment: Get started by picking from our suite of models and deploy them on AWS Sagemaker.

Use Jina Embeddings for RAG

According to the latest blog post from LLamaIndex,

In summary, to achieve the peak performance in both hit rate and MRR, the combination of OpenAI or JinaAI-Base embeddings with the CohereRerank/bge-reranker-large reranker stands out.

Trouble Shooting

Loading of Model Code failed

If you forgot to pass the trust_remote_code=True flag when calling AutoModel.from_pretrained or initializing the model via the SentenceTransformer class, you will receive an error that the model weights could not be initialized. This is caused by tranformers falling back to creating a default BERT model, instead of a jina-embedding model:

Some weights of the model checkpoint at jinaai/jina-embeddings-v2-base-zh were not used when initializing BertModel: ['encoder.layer.2.mlp.layernorm.weight', 'encoder.layer.3.mlp.layernorm.weight', 'encoder.layer.10.mlp.wo.bias', 'encoder.layer.5.mlp.wo.bias', 'encoder.layer.2.mlp.layernorm.bias', 'encoder.layer.1.mlp.gated_layers.weight', 'encoder.layer.5.mlp.gated_layers.weight', 'encoder.layer.8.mlp.layernorm.bias', ...

User is not logged into Huggingface

The model is only availabe under gated access. This means you need to be logged into huggingface load load it. If you receive the following error, you need to provide an access token, either by using the huggingface-cli or providing the token via an environment variable as described above:

OSError: jinaai/jina-embeddings-v2-base-zh is not a local folder and is not a valid model identifier listed on 'https://huggingface.co/models'
If this is a private repository, make sure to pass a token having permission to this repo with `use_auth_token` or log in with `huggingface-cli login` and pass `use_auth_token=True`.

Contact

Join our Discord community and chat with other community members about ideas.

Citation

If you find Jina Embeddings useful in your research, please cite the following paper:

@article{mohr2024multi,
  title={Multi-Task Contrastive Learning for 8192-Token Bilingual Text Embeddings},
  author={Mohr, Isabelle and Krimmel, Markus and Sturua, Saba and Akram, Mohammad Kalim and Koukounas, Andreas and G{\"u}nther, Michael and Mastrapas, Georgios and Ravishankar, Vinit and Mart{\'\i}nez, Joan Fontanals and Wang, Feng and others},
  journal={arXiv preprint arXiv:2402.17016},
  year={2024}
}

README.md Unescape Escape

Quick Start

Intended Usage & Model Info

Data & Parameters

Usage

Why mean pooling?

Alternatives to Using Transformers Package

Use Jina Embeddings for RAG

Trouble Shooting

Contact

Citation

README.md