first commit

This commit is contained in:
xxl 2025-02-21 09:17:42 +08:00
parent cfab9d2bb8
commit 58f4fc6793
6 changed files with 252659 additions and 1 deletions

129
README.md
View File

@ -1,3 +1,130 @@
---
library_name: transformers
license: apache-2.0
datasets:
- deepvk/cultura_ru_edu
- HuggingFaceFW/fineweb-2
- HuggingFaceFW/fineweb
language:
- ru
- en
pipeline_tag: fill-mask
---
# RuModernBERT-base
RuModernBERT-base
The Russian version of the modernized bidirectional encoder-only Transformer model, [ModernBERT](https://arxiv.org/abs/2412.13663).
RuModernBERT was pre-trained on approximately 2 trillion tokens of Russian, English, and code data with a context length of up to 8,192 tokens, using data from the internet, books, scientific sources, and social media.
| | Model Size | Hidden Dim | Num Layers | Vocab Size | Context Length | Task |
|------------------------------------------------------------------------------:|:----------:|:----------:|:----------:|:----------:|:--------------:|:---------:|
| [deepvk/RuModernBERT-small](https://huggingface.co/deepvk/RuModernBERT-small) | 35M | 384 | 12 | 50368 | 8192 | Masked LM |
| deepvk/RuModernBERT-base [this] | 150M | 768 | 22 | 50368 | 8192 | Masked LM |
## Usage
Don't forget to update `transformers` and install `flash-attn` if your GPU supports it.
```python
from transformers import AutoTokenizer, AutoModelForMaskedLM
# Prepare model
model_id = "deepvk/RuModernBERT-base"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForMaskedLM.from_pretrained(model_id, attn_implementation="flash_attention_2")
model = model.eval()
# Prepare input
text = "Лимончелло это настойка из [MASK]."
inputs = tokenizer(text, return_tensors="pt")
masked_index = inputs["input_ids"][0].tolist().index(tokenizer.mask_token_id)
# Make prediction
outputs = model(**inputs)
# Show prediction
predicted_token_id = outputs.logits[0, masked_index].argmax(axis=-1)
predicted_token = tokenizer.decode(predicted_token_id)
print("Predicted token:", predicted_token)
# Predicted token: лимона
```
## Training Details
This is the base version with 150 million parameters and the same configuration as in [`ModernBERT-base`](https://huggingface.co/answerdotai/ModernBERT-base).
The crucial difference lies in the data we used to pre-train this model.
### Tokenizer
We trained a new tokenizer following the original configuration.
We maintained the size of the vocabulary and added the same special tokens.
The tokenizer was trained on a mixture of Russian and English from FineWeb.
### Dataset
Pre-training includes three main stages: massive pre-training, context extension, and cooldown.
Unlike the original model, we did not use the same data for all stages.
For the second and third stages, we used cleaner data sources.
| Data Source | Stage 1 | Stage 2 | Stage 3 |
|----------------------:|:--------:|:-------:|:--------:|
| FineWeb (En+Ru) | ✅ | ❌ | ❌ |
| CulturaX-Ru-Edu (Ru) | ❌ | ✅ | ❌ |
| Wiki (En+Ru) | ✅ | ✅ | ✅ |
| ArXiv (En) | ✅ | ✅ | ✅ |
| Book (En+Ru) | ✅ | ✅ | ✅ |
| Code | ✅ | ✅ | ✅ |
| StackExchange (En+Ru) | ✅ | ✅ | ✅ |
| Social (Ru) | ✅ | ✅ | ✅ |
| **Total Tokens** | 1.7T | 250B | 50B |
### Context length
In the first stage, the model was trained with a context length of `1,024`.
In the second and third stages, it was extended to `8,192`.
## Evaluation
To evaluate the model, we measure quality on the [`encodechka`](https://github.com/avidale/encodechka) and [`Russian Super Glue (RSG)`](https://russiansuperglue.com/) benchmarks.
For RSG, we perform a grid search for optimal hyperparameters and report metrics from the **dev** split.
For a fair comparison, we compare the RuModernBERT model only with raw encoders that were not trained on retrieval or sentence embedding tasks.
### Russian Super Glue
<img src="./rsg.jpg">
| Model | RCB | PARus | MuSeRC | TERRa | RUSSE | RWSD | DaNetQA | Score |
|-------------------------------------------------------------------------------:|:---------:|:------:|:-------:|:-----:|:-------:|:-------:|:-------:|:---------:|
| [deepvk/deberta-v1-distill](https://huggingface.co/deepvk/deberta-v1-distill) | 0.433 | 0.56 | 0.625 | 0.590 | 0.943 | 0.569 | 0.726 | 0.635 |
| [deepvk/deberta-v1-base](https://huggingface.co/deepvk/deberta-v1-base) | 0.450 | 0.61 | 0.722 | 0.704 | 0.948 | 0.578 | **0.760** | 0.682 |
| [ai-forever/ruBert-base](https://huggingface.co/ai-forever/ruBert-base) | 0.491 | 0.61 | 0.663 | 0.769 | 0.962 | 0.574 | 0.678 | 0.678 |
| [deepvk/RuModernBERT-small](https://huggingface.co/deepvk/RuModernBERT-small) | 0.555 | **0.64** | 0.746 | 0.593 | 0.930 | 0.574 | 0.743 | 0.683 |
| deepvk/RuModernBERT-base [this] | **0.556** | 0.61 | **0.857** | **0.818** | **0.977** | **0.583** | 0.758 | **0.737** |
### Encodechka
| | Model Size | STS-B | Paraphraser | XNLI | Sentiment | Toxicity | Inappropriateness | Intents | IntentsX | FactRu | RuDReC | Avg. S | Avg. S+W |
|------------------------------------------------------------------------------------:|:----------:|:--------:|:-----------:|:--------:|:---------:|:--------:|:-----------------:|:--------:|:--------:|:--------:|:--------:|:----------:|:---------:|
| [cointegrated/rubert-tiny](https://huggingface.co/cointegrated/rubert-tiny) | 11.9M | 0.66 | 0.53 | **0.40** | 0.71 | 0.89 | 0.68 | 0.70 | **0.58** | 0.24 | 0.34 | 0.645 | 0.575 |
| [deepvk/deberta-v1-distill](https://huggingface.co/deepvk/deberta-v1-distill) | 81.5M | **0.70** | **0.57** | 0.38 | **0.77** | **0.98** | 0.79 | 0.77 | 0.36 | 0.36 | **0.44** | 0.665 | **0.612** |
| [deepvk/deberta-v1-base](https://huggingface.co/deepvk/deberta-v1-base) | 124M | 0.68 | 0.54 | 0.38 | 0.76 | **0.98** | **0.80** | **0.78** | 0.29 | 0.29 | 0.40 | 0.653 | 0.591 |
| [answerdotai/ModernBERT-base](https://huggingface.co/answerdotai/ModernBERT-base) | 150M | 0.50 | 0.29 | 0.36 | 0.64 | 0.79 | 0.62 | 0.59 | 0.10 | 0.22 | 0.20 | 0.486 | 0.431 |
| [ai-forever/ruBert-base](https://huggingface.co/ai-forever/ruBert-base) | 178M | 0.67 | 0.53 | 0.39 | **0.77** | **0.98** | 0.78 | 0.77 | 0.38 | 🥴 | 🥴 | 0.659 | 🥴 |
| [DeepPavlov/rubert-base-cased](https://huggingface.co/DeepPavlov/rubert-base-cased) | 180M | 0.63 | 0.50 | 0.38 | 0.73 | 0.94 | 0.74 | 0.74 | 0.31 | 🥴 | 🥴 | 0.621 | 🥴 |
| [deepvk/RuModernBERT-small](https://huggingface.co/deepvk/RuModernBERT-small) | 35M | 0.64 | 0.50 | 0.36 | 0.72 | 0.95 | 0.73 | 0.72 | 0.47 | 0.28 | 0.26 | 0.636 | 0.563 |
| deepvk/RuModernBERT-base [this] | 150M | 0.67 | 0.54 | 0.35 | 0.75 | 0.97 | 0.76 | 0.76 | **0.58** | **0.37** | 0.36 | **0.673** | 0.611 |
## Citation
```
@misc{deepvk2025rumodernbert,
title={RuModernBERT: Modernized BERT for Russian},
author={Spirin, Egor and Malashenko, Boris and Sokolov Andrey},
url={https://huggingface.co/deepvk/rumodernbert-base},
publisher={Hugging Face}
year={2025},
}
```

82
config.json Normal file
View File

@ -0,0 +1,82 @@
{
"activation_function": "gelu",
"allow_embedding_resizing": true,
"architectures": [
"ModernBertForMaskedLM"
],
"attention_bias": false,
"attention_dropout": 0.0,
"attention_layer": "rope",
"attention_probs_dropout_prob": 0.0,
"attn_out_bias": false,
"attn_out_dropout_prob": 0.1,
"attn_qkv_bias": false,
"bert_layer": "prenorm",
"bos_token_id": 50281,
"classifier_activation": "gelu",
"classifier_bias": false,
"classifier_dropout": 0.0,
"classifier_pooling": "cls",
"cls_token_id": 50281,
"compile_model": true,
"decoder_bias": true,
"deterministic_flash_attn": false,
"embed_dropout_prob": 0.0,
"embed_norm": true,
"embedding_dropout": 0.0,
"embedding_layer": "sans_pos",
"eos_token_id": 50282,
"final_norm": true,
"global_attn_every_n_layers": 3,
"global_rope_theta": 160000.0,
"head_pred_act": "gelu",
"hidden_act": "gelu",
"hidden_activation": "gelu",
"hidden_size": 768,
"init_method": "full_megatron",
"initializer_cutoff_factor": 2.0,
"initializer_range": 0.02,
"intermediate_size": 1152,
"local_attention": 128,
"local_attn_rotary_emb_base": 10000.0,
"local_rope_theta": 10000.0,
"loss_function": "fa_cross_entropy",
"loss_kwargs": {
"reduction": "mean"
},
"masked_prediction": true,
"max_position_embeddings": 8192,
"mlp_bias": false,
"mlp_dropout": 0.0,
"mlp_dropout_prob": 0.0,
"mlp_in_bias": false,
"mlp_layer": "glu",
"mlp_out_bias": false,
"model_type": "modernbert",
"norm_bias": false,
"norm_eps": 1e-05,
"norm_kwargs": {
"bias": false,
"eps": 1e-05
},
"normalization": "layernorm",
"num_attention_heads": 12,
"num_hidden_layers": 22,
"pad_token_id": 50283,
"padding": "unpadded",
"reference_compile": false,
"repad_logits_with_grad": false,
"rotary_emb_base": 160000.0,
"rotary_emb_dim": null,
"rotary_emb_interleaved": false,
"rotary_emb_scale_base": null,
"sep_token_id": 50282,
"skip_first_prenorm": true,
"sliding_window": 128,
"sparse_pred_ignore_index": -100,
"sparse_prediction": false,
"torch_dtype": "float32",
"transformers_version": "4.48.1",
"unpad_embeddings": true,
"vocab_size": 50368
}

BIN
rsg.jpg (Stored with Git LFS) Normal file

Binary file not shown.

46
special_tokens_map.json Normal file
View File

@ -0,0 +1,46 @@
{
"additional_special_tokens": [
"<|padding|>",
"<|endoftext|>",
"[UNK]",
"[CLS]",
"[SEP]",
"[PAD]",
"[MASK]"
],
"cls_token": {
"content": "[CLS]",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false
},
"mask_token": {
"content": "[MASK]",
"lstrip": true,
"normalized": false,
"rstrip": false,
"single_word": false
},
"pad_token": {
"content": "[PAD]",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false
},
"sep_token": {
"content": "[SEP]",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false
},
"unk_token": {
"content": "[UNK]",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false
}
}

251446
tokenizer.json Normal file

File diff suppressed because it is too large Load Diff

954
tokenizer_config.json Normal file
View File

@ -0,0 +1,954 @@
{
"added_tokens_decoder": {
"0": {
"content": "<|padding|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
},
"1": {
"content": "<|endoftext|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
},
"2": {
"content": " ",
"lstrip": false,
"normalized": true,
"rstrip": false,
"single_word": false,
"special": false
},
"3": {
"content": " ",
"lstrip": false,
"normalized": true,
"rstrip": false,
"single_word": false,
"special": false
},
"4": {
"content": " ",
"lstrip": false,
"normalized": true,
"rstrip": false,
"single_word": false,
"special": false
},
"5": {
"content": "|||EMAIL_ADDRESS|||",
"lstrip": false,
"normalized": true,
"rstrip": false,
"single_word": false,
"special": false
},
"6": {
"content": "|||PHONE_NUMBER|||",
"lstrip": false,
"normalized": true,
"rstrip": false,
"single_word": false,
"special": false
},
"50259": {
"content": "|||IP_ADDRESS|||",
"lstrip": false,
"normalized": true,
"rstrip": false,
"single_word": false,
"special": false
},
"50260": {
"content": " ",
"lstrip": false,
"normalized": true,
"rstrip": false,
"single_word": false,
"special": false
},
"50261": {
"content": " ",
"lstrip": false,
"normalized": true,
"rstrip": false,
"single_word": false,
"special": false
},
"50262": {
"content": " ",
"lstrip": false,
"normalized": true,
"rstrip": false,
"single_word": false,
"special": false
},
"50263": {
"content": " ",
"lstrip": false,
"normalized": true,
"rstrip": false,
"single_word": false,
"special": false
},
"50264": {
"content": " ",
"lstrip": false,
"normalized": true,
"rstrip": false,
"single_word": false,
"special": false
},
"50265": {
"content": " ",
"lstrip": false,
"normalized": true,
"rstrip": false,
"single_word": false,
"special": false
},
"50266": {
"content": " ",
"lstrip": false,
"normalized": true,
"rstrip": false,
"single_word": false,
"special": false
},
"50267": {
"content": " ",
"lstrip": false,
"normalized": true,
"rstrip": false,
"single_word": false,
"special": false
},
"50268": {
"content": " ",
"lstrip": false,
"normalized": true,
"rstrip": false,
"single_word": false,
"special": false
},
"50269": {
"content": " ",
"lstrip": false,
"normalized": true,
"rstrip": false,
"single_word": false,
"special": false
},
"50270": {
"content": " ",
"lstrip": false,
"normalized": true,
"rstrip": false,
"single_word": false,
"special": false
},
"50271": {
"content": " ",
"lstrip": false,
"normalized": true,
"rstrip": false,
"single_word": false,
"special": false
},
"50272": {
"content": " ",
"lstrip": false,
"normalized": true,
"rstrip": false,
"single_word": false,
"special": false
},
"50273": {
"content": " ",
"lstrip": false,
"normalized": true,
"rstrip": false,
"single_word": false,
"special": false
},
"50274": {
"content": " ",
"lstrip": false,
"normalized": true,
"rstrip": false,
"single_word": false,
"special": false
},
"50275": {
"content": " ",
"lstrip": false,
"normalized": true,
"rstrip": false,
"single_word": false,
"special": false
},
"50276": {
"content": " ",
"lstrip": false,
"normalized": true,
"rstrip": false,
"single_word": false,
"special": false
},
"50277": {
"content": " ",
"lstrip": false,
"normalized": true,
"rstrip": false,
"single_word": false,
"special": false
},
"50278": {
"content": " ",
"lstrip": false,
"normalized": true,
"rstrip": false,
"single_word": false,
"special": false
},
"50279": {
"content": " ",
"lstrip": false,
"normalized": true,
"rstrip": false,
"single_word": false,
"special": false
},
"50280": {
"content": "[UNK]",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
},
"50281": {
"content": "[CLS]",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
},
"50282": {
"content": "[SEP]",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
},
"50283": {
"content": "[PAD]",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
},
"50284": {
"content": "[MASK]",
"lstrip": true,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
},
"50285": {
"content": "[unused0]",
"lstrip": false,
"normalized": true,
"rstrip": false,
"single_word": false,
"special": false
},
"50286": {
"content": "[unused1]",
"lstrip": false,
"normalized": true,
"rstrip": false,
"single_word": false,
"special": false
},
"50287": {
"content": "[unused2]",
"lstrip": false,
"normalized": true,
"rstrip": false,
"single_word": false,
"special": false
},
"50288": {
"content": "[unused3]",
"lstrip": false,
"normalized": true,
"rstrip": false,
"single_word": false,
"special": false
},
"50289": {
"content": "[unused4]",
"lstrip": false,
"normalized": true,
"rstrip": false,
"single_word": false,
"special": false
},
"50290": {
"content": "[unused5]",
"lstrip": false,
"normalized": true,
"rstrip": false,
"single_word": false,
"special": false
},
"50291": {
"content": "[unused6]",
"lstrip": false,
"normalized": true,
"rstrip": false,
"single_word": false,
"special": false
},
"50292": {
"content": "[unused7]",
"lstrip": false,
"normalized": true,
"rstrip": false,
"single_word": false,
"special": false
},
"50293": {
"content": "[unused8]",
"lstrip": false,
"normalized": true,
"rstrip": false,
"single_word": false,
"special": false
},
"50294": {
"content": "[unused9]",
"lstrip": false,
"normalized": true,
"rstrip": false,
"single_word": false,
"special": false
},
"50295": {
"content": "[unused10]",
"lstrip": false,
"normalized": true,
"rstrip": false,
"single_word": false,
"special": false
},
"50296": {
"content": "[unused11]",
"lstrip": false,
"normalized": true,
"rstrip": false,
"single_word": false,
"special": false
},
"50297": {
"content": "[unused12]",
"lstrip": false,
"normalized": true,
"rstrip": false,
"single_word": false,
"special": false
},
"50298": {
"content": "[unused13]",
"lstrip": false,
"normalized": true,
"rstrip": false,
"single_word": false,
"special": false
},
"50299": {
"content": "[unused14]",
"lstrip": false,
"normalized": true,
"rstrip": false,
"single_word": false,
"special": false
},
"50300": {
"content": "[unused15]",
"lstrip": false,
"normalized": true,
"rstrip": false,
"single_word": false,
"special": false
},
"50301": {
"content": "[unused16]",
"lstrip": false,
"normalized": true,
"rstrip": false,
"single_word": false,
"special": false
},
"50302": {
"content": "[unused17]",
"lstrip": false,
"normalized": true,
"rstrip": false,
"single_word": false,
"special": false
},
"50303": {
"content": "[unused18]",
"lstrip": false,
"normalized": true,
"rstrip": false,
"single_word": false,
"special": false
},
"50304": {
"content": "[unused19]",
"lstrip": false,
"normalized": true,
"rstrip": false,
"single_word": false,
"special": false
},
"50305": {
"content": "[unused20]",
"lstrip": false,
"normalized": true,
"rstrip": false,
"single_word": false,
"special": false
},
"50306": {
"content": "[unused21]",
"lstrip": false,
"normalized": true,
"rstrip": false,
"single_word": false,
"special": false
},
"50307": {
"content": "[unused22]",
"lstrip": false,
"normalized": true,
"rstrip": false,
"single_word": false,
"special": false
},
"50308": {
"content": "[unused23]",
"lstrip": false,
"normalized": true,
"rstrip": false,
"single_word": false,
"special": false
},
"50309": {
"content": "[unused24]",
"lstrip": false,
"normalized": true,
"rstrip": false,
"single_word": false,
"special": false
},
"50310": {
"content": "[unused25]",
"lstrip": false,
"normalized": true,
"rstrip": false,
"single_word": false,
"special": false
},
"50311": {
"content": "[unused26]",
"lstrip": false,
"normalized": true,
"rstrip": false,
"single_word": false,
"special": false
},
"50312": {
"content": "[unused27]",
"lstrip": false,
"normalized": true,
"rstrip": false,
"single_word": false,
"special": false
},
"50313": {
"content": "[unused28]",
"lstrip": false,
"normalized": true,
"rstrip": false,
"single_word": false,
"special": false
},
"50314": {
"content": "[unused29]",
"lstrip": false,
"normalized": true,
"rstrip": false,
"single_word": false,
"special": false
},
"50315": {
"content": "[unused30]",
"lstrip": false,
"normalized": true,
"rstrip": false,
"single_word": false,
"special": false
},
"50316": {
"content": "[unused31]",
"lstrip": false,
"normalized": true,
"rstrip": false,
"single_word": false,
"special": false
},
"50317": {
"content": "[unused32]",
"lstrip": false,
"normalized": true,
"rstrip": false,
"single_word": false,
"special": false
},
"50318": {
"content": "[unused33]",
"lstrip": false,
"normalized": true,
"rstrip": false,
"single_word": false,
"special": false
},
"50319": {
"content": "[unused34]",
"lstrip": false,
"normalized": true,
"rstrip": false,
"single_word": false,
"special": false
},
"50320": {
"content": "[unused35]",
"lstrip": false,
"normalized": true,
"rstrip": false,
"single_word": false,
"special": false
},
"50321": {
"content": "[unused36]",
"lstrip": false,
"normalized": true,
"rstrip": false,
"single_word": false,
"special": false
},
"50322": {
"content": "[unused37]",
"lstrip": false,
"normalized": true,
"rstrip": false,
"single_word": false,
"special": false
},
"50323": {
"content": "[unused38]",
"lstrip": false,
"normalized": true,
"rstrip": false,
"single_word": false,
"special": false
},
"50324": {
"content": "[unused39]",
"lstrip": false,
"normalized": true,
"rstrip": false,
"single_word": false,
"special": false
},
"50325": {
"content": "[unused40]",
"lstrip": false,
"normalized": true,
"rstrip": false,
"single_word": false,
"special": false
},
"50326": {
"content": "[unused41]",
"lstrip": false,
"normalized": true,
"rstrip": false,
"single_word": false,
"special": false
},
"50327": {
"content": "[unused42]",
"lstrip": false,
"normalized": true,
"rstrip": false,
"single_word": false,
"special": false
},
"50328": {
"content": "[unused43]",
"lstrip": false,
"normalized": true,
"rstrip": false,
"single_word": false,
"special": false
},
"50329": {
"content": "[unused44]",
"lstrip": false,
"normalized": true,
"rstrip": false,
"single_word": false,
"special": false
},
"50330": {
"content": "[unused45]",
"lstrip": false,
"normalized": true,
"rstrip": false,
"single_word": false,
"special": false
},
"50331": {
"content": "[unused46]",
"lstrip": false,
"normalized": true,
"rstrip": false,
"single_word": false,
"special": false
},
"50332": {
"content": "[unused47]",
"lstrip": false,
"normalized": true,
"rstrip": false,
"single_word": false,
"special": false
},
"50333": {
"content": "[unused48]",
"lstrip": false,
"normalized": true,
"rstrip": false,
"single_word": false,
"special": false
},
"50334": {
"content": "[unused49]",
"lstrip": false,
"normalized": true,
"rstrip": false,
"single_word": false,
"special": false
},
"50335": {
"content": "[unused50]",
"lstrip": false,
"normalized": true,
"rstrip": false,
"single_word": false,
"special": false
},
"50336": {
"content": "[unused51]",
"lstrip": false,
"normalized": true,
"rstrip": false,
"single_word": false,
"special": false
},
"50337": {
"content": "[unused52]",
"lstrip": false,
"normalized": true,
"rstrip": false,
"single_word": false,
"special": false
},
"50338": {
"content": "[unused53]",
"lstrip": false,
"normalized": true,
"rstrip": false,
"single_word": false,
"special": false
},
"50339": {
"content": "[unused54]",
"lstrip": false,
"normalized": true,
"rstrip": false,
"single_word": false,
"special": false
},
"50340": {
"content": "[unused55]",
"lstrip": false,
"normalized": true,
"rstrip": false,
"single_word": false,
"special": false
},
"50341": {
"content": "[unused56]",
"lstrip": false,
"normalized": true,
"rstrip": false,
"single_word": false,
"special": false
},
"50342": {
"content": "[unused57]",
"lstrip": false,
"normalized": true,
"rstrip": false,
"single_word": false,
"special": false
},
"50343": {
"content": "[unused58]",
"lstrip": false,
"normalized": true,
"rstrip": false,
"single_word": false,
"special": false
},
"50344": {
"content": "[unused59]",
"lstrip": false,
"normalized": true,
"rstrip": false,
"single_word": false,
"special": false
},
"50345": {
"content": "[unused60]",
"lstrip": false,
"normalized": true,
"rstrip": false,
"single_word": false,
"special": false
},
"50346": {
"content": "[unused61]",
"lstrip": false,
"normalized": true,
"rstrip": false,
"single_word": false,
"special": false
},
"50347": {
"content": "[unused62]",
"lstrip": false,
"normalized": true,
"rstrip": false,
"single_word": false,
"special": false
},
"50348": {
"content": "[unused63]",
"lstrip": false,
"normalized": true,
"rstrip": false,
"single_word": false,
"special": false
},
"50349": {
"content": "[unused64]",
"lstrip": false,
"normalized": true,
"rstrip": false,
"single_word": false,
"special": false
},
"50350": {
"content": "[unused65]",
"lstrip": false,
"normalized": true,
"rstrip": false,
"single_word": false,
"special": false
},
"50351": {
"content": "[unused66]",
"lstrip": false,
"normalized": true,
"rstrip": false,
"single_word": false,
"special": false
},
"50352": {
"content": "[unused67]",
"lstrip": false,
"normalized": true,
"rstrip": false,
"single_word": false,
"special": false
},
"50353": {
"content": "[unused68]",
"lstrip": false,
"normalized": true,
"rstrip": false,
"single_word": false,
"special": false
},
"50354": {
"content": "[unused69]",
"lstrip": false,
"normalized": true,
"rstrip": false,
"single_word": false,
"special": false
},
"50355": {
"content": "[unused70]",
"lstrip": false,
"normalized": true,
"rstrip": false,
"single_word": false,
"special": false
},
"50356": {
"content": "[unused71]",
"lstrip": false,
"normalized": true,
"rstrip": false,
"single_word": false,
"special": false
},
"50357": {
"content": "[unused72]",
"lstrip": false,
"normalized": true,
"rstrip": false,
"single_word": false,
"special": false
},
"50358": {
"content": "[unused73]",
"lstrip": false,
"normalized": true,
"rstrip": false,
"single_word": false,
"special": false
},
"50359": {
"content": "[unused74]",
"lstrip": false,
"normalized": true,
"rstrip": false,
"single_word": false,
"special": false
},
"50360": {
"content": "[unused75]",
"lstrip": false,
"normalized": true,
"rstrip": false,
"single_word": false,
"special": false
},
"50361": {
"content": "[unused76]",
"lstrip": false,
"normalized": true,
"rstrip": false,
"single_word": false,
"special": false
},
"50362": {
"content": "[unused77]",
"lstrip": false,
"normalized": true,
"rstrip": false,
"single_word": false,
"special": false
},
"50363": {
"content": "[unused78]",
"lstrip": false,
"normalized": true,
"rstrip": false,
"single_word": false,
"special": false
},
"50364": {
"content": "[unused79]",
"lstrip": false,
"normalized": true,
"rstrip": false,
"single_word": false,
"special": false
},
"50365": {
"content": "[unused80]",
"lstrip": false,
"normalized": true,
"rstrip": false,
"single_word": false,
"special": false
},
"50366": {
"content": "[unused81]",
"lstrip": false,
"normalized": true,
"rstrip": false,
"single_word": false,
"special": false
},
"50367": {
"content": "[unused82]",
"lstrip": false,
"normalized": true,
"rstrip": false,
"single_word": false,
"special": false
}
},
"additional_special_tokens": [
"<|padding|>",
"<|endoftext|>",
"[UNK]",
"[CLS]",
"[SEP]",
"[PAD]",
"[MASK]"
],
"clean_up_tokenization_spaces": true,
"cls_token": "[CLS]",
"extra_special_tokens": {},
"mask_token": "[MASK]",
"model_input_names": [
"input_ids",
"attention_mask"
],
"model_max_length": 1000000000000000019884624838656,
"pad_token": "[PAD]",
"sep_token": "[SEP]",
"tokenizer_class": "PreTrainedTokenizerFast",
"unk_token": "[UNK]"
}