first commit
This commit is contained in:
parent
153e02c159
commit
d703ded761
173
README.md
173
README.md
|
@ -1,3 +1,172 @@
|
|||
# SmolVLM_a13740444511891456137163
|
||||
---
|
||||
library_name: transformers
|
||||
license: apache-2.0
|
||||
datasets:
|
||||
- HuggingFaceM4/the_cauldron
|
||||
- HuggingFaceM4/Docmatix
|
||||
pipeline_tag: image-text-to-text
|
||||
language:
|
||||
- en
|
||||
base_model:
|
||||
- HuggingFaceTB/SmolLM2-1.7B-Instruct
|
||||
- google/siglip-so400m-patch14-384
|
||||
---
|
||||
<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/SmolVLM.png" width="800" height="auto" alt="Image description">
|
||||
|
||||
SmolVLM
|
||||
# SmolVLM
|
||||
|
||||
SmolVLM is a compact open multimodal model that accepts arbitrary sequences of image and text inputs to produce text outputs.
|
||||
Designed for efficiency, SmolVLM can answer questions about images, describe visual content, create stories grounded on multiple images,
|
||||
or function as a pure language model without visual inputs. Its lightweight architecture makes it suitable for on-device applications
|
||||
while maintaining strong performance on multimodal tasks.
|
||||
|
||||
## Model Summary
|
||||
|
||||
- **Developed by:** Hugging Face 🤗
|
||||
- **Model type:** Multi-modal model (image+text)
|
||||
- **Language(s) (NLP):** English
|
||||
- **License:** Apache 2.0
|
||||
- **Architecture:** Based on [Idefics3](https://huggingface.co/HuggingFaceM4/Idefics3-8B-Llama3) (see technical summary)
|
||||
|
||||
## Resources
|
||||
|
||||
- **Demo:** [SmolVLM Demo](https://huggingface.co/spaces/HuggingFaceTB/SmolVLM)
|
||||
- **Blog:** [Blog post](https://huggingface.co/blog/smolvlm)
|
||||
|
||||
## Uses
|
||||
|
||||
SmolVLM can be used for inference on multimodal (image + text) tasks where the input comprises text queries along with one or more images.
|
||||
Text and images can be interleaved arbitrarily, enabling tasks like image captioning, visual question answering, and storytelling based on
|
||||
visual content. The model does not support image generation.
|
||||
|
||||
To fine-tune SmolVLM on a specific task, you can follow the fine-tuning tutorial.
|
||||
<!-- todo: add link to fine-tuning tutorial -->
|
||||
|
||||
### Technical Summary
|
||||
|
||||
SmolVLM leverages the lightweight SmolLM2 language model to provide a compact yet powerful multimodal experience.
|
||||
It introduces several changes compared to previous Idefics models:
|
||||
|
||||
- **Image compression:** We introduce a more radical image compression compared to Idefics3 to enable the model to infer faster and use less RAM.
|
||||
- **Visual Token Encoding:** SmolVLM uses 81 visual tokens to encode image patches of size 384×384. Larger images are divided into patches, each encoded separately, enhancing efficiency without compromising performance.
|
||||
|
||||
More details about the training and architecture are available in our technical report.
|
||||
|
||||
|
||||
### How to get started
|
||||
|
||||
You can use transformers to load, infer and fine-tune SmolVLM.
|
||||
|
||||
```python
|
||||
import torch
|
||||
from PIL import Image
|
||||
from transformers import AutoProcessor, AutoModelForVision2Seq
|
||||
from transformers.image_utils import load_image
|
||||
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
|
||||
# Load images
|
||||
image1 = load_image("https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg")
|
||||
image2 = load_image("https://huggingface.co/spaces/merve/chameleon-7b/resolve/main/bee.jpg")
|
||||
# Initialize processor and model
|
||||
processor = AutoProcessor.from_pretrained("HuggingFaceTB/SmolVLM-Base")
|
||||
model = AutoModelForVision2Seq.from_pretrained(
|
||||
"HuggingFaceTB/SmolVLM-Base",
|
||||
torch_dtype=torch.bfloat16,
|
||||
_attn_implementation="flash_attention_2" if DEVICE == "cuda" else "eager",
|
||||
).to(DEVICE)
|
||||
# Create input messages
|
||||
messages = [
|
||||
{
|
||||
"role": "user",
|
||||
"content": [
|
||||
{"type": "image"},
|
||||
{"type": "image"},
|
||||
{"type": "text", "text": "Can you describe the two images?"}
|
||||
]
|
||||
},
|
||||
]
|
||||
# Prepare inputs
|
||||
prompt = processor.apply_chat_template(messages, add_generation_prompt=True)
|
||||
inputs = processor(text=prompt, images=[image1, image2], return_tensors="pt")
|
||||
inputs = inputs.to(DEVICE)
|
||||
# Generate outputs
|
||||
generated_ids = model.generate(**inputs, max_new_tokens=500)
|
||||
generated_texts = processor.batch_decode(
|
||||
generated_ids,
|
||||
skip_special_tokens=True,
|
||||
)
|
||||
print(generated_texts[0])
|
||||
"""
|
||||
User:<image>Can you describe the two images?
|
||||
Assistant: I can describe the first one, but I can't describe the second one.
|
||||
"""
|
||||
```
|
||||
|
||||
|
||||
### Model optimizations
|
||||
|
||||
**Precision**: For better performance, load and run the model in half-precision (`torch.float16` or `torch.bfloat16`) if your hardware supports it.
|
||||
|
||||
```python
|
||||
from transformers import AutoModelForVision2Seq
|
||||
import torch
|
||||
model = AutoModelForVision2Seq.from_pretrained(
|
||||
"HuggingFaceTB/SmolVLM-Base",
|
||||
torch_dtype=torch.bfloat16
|
||||
).to("cuda")
|
||||
```
|
||||
|
||||
You can also load SmolVLM with 4/8-bit quantization using bitsandbytes, torchao or Quanto. Refer to [this page](https://huggingface.co/docs/transformers/en/main_classes/quantization) for other options.
|
||||
|
||||
```python
|
||||
from transformers import AutoModelForVision2Seq, BitsAndBytesConfig
|
||||
import torch
|
||||
quantization_config = BitsAndBytesConfig(load_in_8bit=True)
|
||||
model = AutoModelForVision2Seq.from_pretrained(
|
||||
"HuggingFaceTB/SmolVLM-Base",
|
||||
quantization_config=quantization_config,
|
||||
)
|
||||
```
|
||||
|
||||
**Vision Encoder Efficiency**: Adjust the image resolution by setting `size={"longest_edge": N*384}` when initializing the processor, where N is your desired value. The default `N=4` works well, which results in input images of
|
||||
size 1536×1536. For documents, `N=5` might be beneficial. Decreasing N can save GPU memory and is appropriate for lower-resolution images. This is also useful if you want to fine-tune on videos.
|
||||
|
||||
|
||||
## Misuse and Out-of-scope Use
|
||||
|
||||
SmolVLM is not intended for high-stakes scenarios or critical decision-making processes that affect an individual's well-being or livelihood. The model may produce content that appears factual but may not be accurate. Misuse includes, but is not limited to:
|
||||
|
||||
- Prohibited Uses:
|
||||
- Evaluating or scoring individuals (e.g., in employment, education, credit)
|
||||
- Critical automated decision-making
|
||||
- Generating unreliable factual content
|
||||
- Malicious Activities:
|
||||
- Spam generation
|
||||
- Disinformation campaigns
|
||||
- Harassment or abuse
|
||||
- Unauthorized surveillance
|
||||
|
||||
### License
|
||||
|
||||
SmolVLM is built upon [the shape-optimized SigLIP](https://huggingface.co/google/siglip-so400m-patch14-384) as image encoder and [SmolLM2](https://huggingface.co/HuggingFaceTB/SmolLM2-1.7B-Instruct) for text decoder part.
|
||||
|
||||
We release the SmolVLM checkpoints under the Apache 2.0 license.
|
||||
|
||||
## Training Details
|
||||
|
||||
### Training Data
|
||||
|
||||
The training data comes from [The Cauldron](https://huggingface.co/datasets/HuggingFaceM4/the_cauldron) and [Docmatix](https://huggingface.co/datasets/HuggingFaceM4/Docmatix) datasets, with emphasis on document understanding (25%) and image captioning (18%), while maintaining balanced coverage across other crucial capabilities like visual reasoning, chart comprehension, and general instruction following.
|
||||
<img src="https://huggingface.co/HuggingFaceTB/SmolVLM-Instruct/resolve/main/mixture_the_cauldron.png" alt="Example Image" style="width:90%;" />
|
||||
|
||||
|
||||
## Evaluation
|
||||
|
||||
| Model | MMMU (val) | MathVista (testmini) | MMStar (val) | DocVQA (test) | TextVQA (val) | Min GPU RAM required (GB) |
|
||||
|-------------------|------------|----------------------|--------------|---------------|---------------|---------------------------|
|
||||
| SmolVLM | 38.8 | 44.6 | 42.1 | 81.6 | 72.7 | 5.02 |
|
||||
| Qwen-VL 2B | 41.1 | 47.8 | 47.5 | 90.1 | 79.7 | 13.70 |
|
||||
| InternVL2 2B | 34.3 | 46.3 | 49.8 | 86.9 | 73.4 | 10.52 |
|
||||
| PaliGemma 3B 448px| 34.9 | 28.7 | 48.3 | 32.2 | 56.0 | 6.72 |
|
||||
| moondream2 | 32.4 | 24.3 | 40.3 | 70.5 | 65.2 | 3.87 |
|
||||
| MiniCPM-V-2 | 38.2 | 39.8 | 39.1 | 71.9 | 74.1 | 7.88 |
|
||||
| MM1.5 1B | 35.8 | 37.2 | 0.0 | 81.0 | 72.5 | NaN |
|
|
@ -0,0 +1,5 @@
|
|||
{
|
||||
"<end_of_utterance>": 49154,
|
||||
"<fake_token_around_image>": 49152,
|
||||
"<image>": 49153
|
||||
}
|
|
@ -0,0 +1,3 @@
|
|||
{
|
||||
"chat_template": "<|im_start|>{% for message in messages %}{{message['role'].capitalize()}}{% if message['content'][0]['type'] == 'image' %}{{':'}}{% else %}{{': '}}{% endif %}{% for line in message['content'] %}{% if line['type'] == 'text' %}{{line['text']}}{% elif line['type'] == 'image' %}{{ '<image>' }}{% endif %}{% endfor %}<|endoftext|>\n{% endfor %}{% if add_generation_prompt %}{{ 'Assistant:' }}{% endif %}"
|
||||
}
|
|
@ -0,0 +1,252 @@
|
|||
{
|
||||
"architectures": [
|
||||
"Idefics3ForConditionalGeneration"
|
||||
],
|
||||
"image_seq_len": 81,
|
||||
"image_token_id": 49153,
|
||||
"model_type": "idefics3",
|
||||
"scale_factor": 3,
|
||||
"text_config": {
|
||||
"_attn_implementation_autoset": false,
|
||||
"_flash_attn_2_enabled": true,
|
||||
"_name_or_path": "/fsx/m4/experiments/local_experiment_dir/s3_async_temporary_checkpoint_folder/tr_324_opt_400/unwrapped_model",
|
||||
"add_cross_attention": false,
|
||||
"architectures": [
|
||||
"VLlama3ForCausalLM"
|
||||
],
|
||||
"attention_bias": false,
|
||||
"attention_dropout": 0.0,
|
||||
"bad_words_ids": null,
|
||||
"begin_suppress_tokens": null,
|
||||
"bos_token_id": 0,
|
||||
"chunk_size_feed_forward": 0,
|
||||
"cross_attention_hidden_size": null,
|
||||
"decoder_start_token_id": null,
|
||||
"diversity_penalty": 0.0,
|
||||
"do_sample": false,
|
||||
"early_stopping": false,
|
||||
"encoder_no_repeat_ngram_size": 0,
|
||||
"eos_token_id": 0,
|
||||
"exponential_decay_length_penalty": null,
|
||||
"finetuning_task": null,
|
||||
"forced_bos_token_id": null,
|
||||
"forced_eos_token_id": null,
|
||||
"head_dim": 64,
|
||||
"hidden_act": "silu",
|
||||
"hidden_size": 2048,
|
||||
"id2label": {
|
||||
"0": "LABEL_0",
|
||||
"1": "LABEL_1"
|
||||
},
|
||||
"initializer_range": 0.02,
|
||||
"intermediate_size": 8192,
|
||||
"is_decoder": false,
|
||||
"is_encoder_decoder": false,
|
||||
"label2id": {
|
||||
"LABEL_0": 0,
|
||||
"LABEL_1": 1
|
||||
},
|
||||
"length_penalty": 1.0,
|
||||
"max_length": 20,
|
||||
"max_position_embeddings": 16384,
|
||||
"min_length": 0,
|
||||
"mlp_bias": false,
|
||||
"model_type": "llama",
|
||||
"neftune_noise_alpha": 0.0,
|
||||
"no_repeat_ngram_size": 0,
|
||||
"num_attention_heads": 32,
|
||||
"num_beam_groups": 1,
|
||||
"num_beams": 1,
|
||||
"num_hidden_layers": 24,
|
||||
"num_key_value_heads": 32,
|
||||
"num_return_sequences": 1,
|
||||
"output_attentions": false,
|
||||
"output_hidden_states": false,
|
||||
"output_scores": false,
|
||||
"pad_token_id": 2,
|
||||
"perceiver_config": {
|
||||
"_attn_implementation_autoset": false,
|
||||
"_name_or_path": "",
|
||||
"add_cross_attention": false,
|
||||
"architectures": null,
|
||||
"attention_dropout": 0.0,
|
||||
"bad_words_ids": null,
|
||||
"begin_suppress_tokens": null,
|
||||
"bos_token_id": null,
|
||||
"chunk_size_feed_forward": 0,
|
||||
"cross_attention_hidden_size": null,
|
||||
"decoder_start_token_id": null,
|
||||
"diversity_penalty": 0.0,
|
||||
"do_sample": false,
|
||||
"early_stopping": false,
|
||||
"encoder_no_repeat_ngram_size": 0,
|
||||
"eos_token_id": null,
|
||||
"exponential_decay_length_penalty": null,
|
||||
"finetuning_task": null,
|
||||
"forced_bos_token_id": null,
|
||||
"forced_eos_token_id": null,
|
||||
"hidden_act": "silu",
|
||||
"id2label": {
|
||||
"0": "LABEL_0",
|
||||
"1": "LABEL_1"
|
||||
},
|
||||
"is_decoder": false,
|
||||
"is_encoder_decoder": false,
|
||||
"label2id": {
|
||||
"LABEL_0": 0,
|
||||
"LABEL_1": 1
|
||||
},
|
||||
"length_penalty": 1.0,
|
||||
"max_length": 20,
|
||||
"min_length": 0,
|
||||
"model_type": "vllama3",
|
||||
"no_repeat_ngram_size": 0,
|
||||
"num_beam_groups": 1,
|
||||
"num_beams": 1,
|
||||
"num_key_value_heads": 1,
|
||||
"num_return_sequences": 1,
|
||||
"output_attentions": false,
|
||||
"output_hidden_states": false,
|
||||
"output_scores": false,
|
||||
"pad_token_id": null,
|
||||
"prefix": null,
|
||||
"problem_type": null,
|
||||
"pruned_heads": {},
|
||||
"qk_layer_norms_perceiver": false,
|
||||
"remove_invalid_values": false,
|
||||
"repetition_penalty": 1.0,
|
||||
"resampler_depth": 6,
|
||||
"resampler_head_dim": 96,
|
||||
"resampler_n_heads": 16,
|
||||
"resampler_n_latents": 64,
|
||||
"return_dict": true,
|
||||
"return_dict_in_generate": false,
|
||||
"sep_token_id": null,
|
||||
"suppress_tokens": null,
|
||||
"task_specific_params": null,
|
||||
"temperature": 1.0,
|
||||
"tf_legacy_loss": false,
|
||||
"tie_encoder_decoder": false,
|
||||
"tie_word_embeddings": true,
|
||||
"tokenizer_class": null,
|
||||
"top_k": 50,
|
||||
"top_p": 1.0,
|
||||
"torch_dtype": null,
|
||||
"torchscript": false,
|
||||
"transformers_version": "4.46.0",
|
||||
"typical_p": 1.0,
|
||||
"use_bfloat16": false
|
||||
},
|
||||
"prefix": null,
|
||||
"pretraining_tp": 1,
|
||||
"problem_type": null,
|
||||
"pruned_heads": {},
|
||||
"qk_layer_norms": false,
|
||||
"remove_invalid_values": false,
|
||||
"repetition_penalty": 1.0,
|
||||
"return_dict": true,
|
||||
"return_dict_in_generate": false,
|
||||
"rms_norm_eps": 1e-05,
|
||||
"rope_scaling": null,
|
||||
"rope_theta": 273768.0,
|
||||
"sep_token_id": null,
|
||||
"suppress_tokens": null,
|
||||
"task_specific_params": null,
|
||||
"temperature": 1.0,
|
||||
"tf_legacy_loss": false,
|
||||
"tie_encoder_decoder": false,
|
||||
"tie_word_embeddings": false,
|
||||
"tokenizer_class": null,
|
||||
"top_k": 50,
|
||||
"top_p": 1.0,
|
||||
"torch_dtype": "bfloat16",
|
||||
"torchscript": false,
|
||||
"typical_p": 1.0,
|
||||
"use_bfloat16": false,
|
||||
"use_cache": true,
|
||||
"use_resampler": false,
|
||||
"vocab_size": 49155
|
||||
},
|
||||
"tie_word_embeddings": false,
|
||||
"torch_dtype": "bfloat16",
|
||||
"transformers_version": "4.46.0",
|
||||
"use_cache": true,
|
||||
"vision_config": {
|
||||
"size": {"longest_edge": 1920},
|
||||
"max_image_size": {"longest_edge": 384},
|
||||
"_attn_implementation_autoset": false,
|
||||
"_name_or_path": "",
|
||||
"add_cross_attention": false,
|
||||
"architectures": null,
|
||||
"attention_dropout": 0.0,
|
||||
"bad_words_ids": null,
|
||||
"begin_suppress_tokens": null,
|
||||
"bos_token_id": null,
|
||||
"chunk_size_feed_forward": 0,
|
||||
"cross_attention_hidden_size": null,
|
||||
"decoder_start_token_id": null,
|
||||
"diversity_penalty": 0.0,
|
||||
"do_sample": false,
|
||||
"early_stopping": false,
|
||||
"encoder_no_repeat_ngram_size": 0,
|
||||
"eos_token_id": null,
|
||||
"exponential_decay_length_penalty": null,
|
||||
"finetuning_task": null,
|
||||
"forced_bos_token_id": null,
|
||||
"forced_eos_token_id": null,
|
||||
"hidden_act": "gelu_pytorch_tanh",
|
||||
"hidden_size": 1152,
|
||||
"id2label": {
|
||||
"0": "LABEL_0",
|
||||
"1": "LABEL_1"
|
||||
},
|
||||
"image_size": 384,
|
||||
"initializer_range": 0.02,
|
||||
"intermediate_size": 4304,
|
||||
"is_decoder": false,
|
||||
"is_encoder_decoder": false,
|
||||
"label2id": {
|
||||
"LABEL_0": 0,
|
||||
"LABEL_1": 1
|
||||
},
|
||||
"layer_norm_eps": 1e-06,
|
||||
"length_penalty": 1.0,
|
||||
"max_length": 20,
|
||||
"min_length": 0,
|
||||
"model_type": "idefics3",
|
||||
"no_repeat_ngram_size": 0,
|
||||
"num_attention_heads": 16,
|
||||
"num_beam_groups": 1,
|
||||
"num_beams": 1,
|
||||
"num_channels": 3,
|
||||
"num_hidden_layers": 27,
|
||||
"num_return_sequences": 1,
|
||||
"output_attentions": false,
|
||||
"output_hidden_states": false,
|
||||
"output_scores": false,
|
||||
"pad_token_id": null,
|
||||
"patch_size": 14,
|
||||
"prefix": null,
|
||||
"problem_type": null,
|
||||
"pruned_heads": {},
|
||||
"remove_invalid_values": false,
|
||||
"repetition_penalty": 1.0,
|
||||
"return_dict": true,
|
||||
"return_dict_in_generate": false,
|
||||
"sep_token_id": null,
|
||||
"suppress_tokens": null,
|
||||
"task_specific_params": null,
|
||||
"temperature": 1.0,
|
||||
"tf_legacy_loss": false,
|
||||
"tie_encoder_decoder": false,
|
||||
"tie_word_embeddings": false,
|
||||
"tokenizer_class": null,
|
||||
"top_k": 50,
|
||||
"top_p": 1.0,
|
||||
"torch_dtype": null,
|
||||
"torchscript": false,
|
||||
"typical_p": 1.0,
|
||||
"use_bfloat16": false
|
||||
},
|
||||
"vocab_size": 49155
|
||||
}
|
|
@ -0,0 +1 @@
|
|||
{"framework": "pytorch", "task": "image-text-to-text", "allow_remote": true}
|
|
@ -0,0 +1,7 @@
|
|||
{
|
||||
"_from_model_config": true,
|
||||
"bos_token_id": 0,
|
||||
"eos_token_id": 0,
|
||||
"pad_token_id": 2,
|
||||
"transformers_version": "4.46.0"
|
||||
}
|
File diff suppressed because it is too large
Load Diff
Binary file not shown.
|
@ -0,0 +1,28 @@
|
|||
{
|
||||
"do_convert_rgb": true,
|
||||
"do_image_splitting": true,
|
||||
"do_normalize": true,
|
||||
"do_pad": true,
|
||||
"do_rescale": true,
|
||||
"do_resize": true,
|
||||
"image_mean": [
|
||||
0.5,
|
||||
0.5,
|
||||
0.5
|
||||
],
|
||||
"image_processor_type": "Idefics3ImageProcessor",
|
||||
"image_std": [
|
||||
0.5,
|
||||
0.5,
|
||||
0.5
|
||||
],
|
||||
"max_image_size": {
|
||||
"longest_edge": 384
|
||||
},
|
||||
"processor_class": "Idefics3Processor",
|
||||
"resample": 1,
|
||||
"rescale_factor": 0.00392156862745098,
|
||||
"size": {
|
||||
"longest_edge": 1536
|
||||
}
|
||||
}
|
|
@ -0,0 +1,4 @@
|
|||
{
|
||||
"image_seq_len": 81,
|
||||
"processor_class": "Idefics3Processor"
|
||||
}
|
|
@ -0,0 +1,53 @@
|
|||
{
|
||||
"additional_special_tokens": [
|
||||
{
|
||||
"content": "<fake_token_around_image>",
|
||||
"lstrip": false,
|
||||
"normalized": false,
|
||||
"rstrip": false,
|
||||
"single_word": false
|
||||
},
|
||||
{
|
||||
"content": "<image>",
|
||||
"lstrip": false,
|
||||
"normalized": false,
|
||||
"rstrip": false,
|
||||
"single_word": false
|
||||
},
|
||||
{
|
||||
"content": "<end_of_utterance>",
|
||||
"lstrip": false,
|
||||
"normalized": false,
|
||||
"rstrip": false,
|
||||
"single_word": false
|
||||
}
|
||||
],
|
||||
"bos_token": {
|
||||
"content": "<|im_start|>",
|
||||
"lstrip": false,
|
||||
"normalized": false,
|
||||
"rstrip": false,
|
||||
"single_word": false
|
||||
},
|
||||
"eos_token": {
|
||||
"content": "<|im_end|>",
|
||||
"lstrip": false,
|
||||
"normalized": false,
|
||||
"rstrip": false,
|
||||
"single_word": false
|
||||
},
|
||||
"pad_token": {
|
||||
"content": "<|im_end|>",
|
||||
"lstrip": false,
|
||||
"normalized": false,
|
||||
"rstrip": false,
|
||||
"single_word": false
|
||||
},
|
||||
"unk_token": {
|
||||
"content": "<|endoftext|>",
|
||||
"lstrip": false,
|
||||
"normalized": false,
|
||||
"rstrip": false,
|
||||
"single_word": false
|
||||
}
|
||||
}
|
File diff suppressed because it is too large
Load Diff
|
@ -0,0 +1,181 @@
|
|||
{
|
||||
"add_prefix_space": false,
|
||||
"added_tokens_decoder": {
|
||||
"0": {
|
||||
"content": "<|endoftext|>",
|
||||
"lstrip": false,
|
||||
"normalized": false,
|
||||
"rstrip": false,
|
||||
"single_word": false,
|
||||
"special": true
|
||||
},
|
||||
"1": {
|
||||
"content": "<|im_start|>",
|
||||
"lstrip": false,
|
||||
"normalized": false,
|
||||
"rstrip": false,
|
||||
"single_word": false,
|
||||
"special": true
|
||||
},
|
||||
"2": {
|
||||
"content": "<|im_end|>",
|
||||
"lstrip": false,
|
||||
"normalized": false,
|
||||
"rstrip": false,
|
||||
"single_word": false,
|
||||
"special": true
|
||||
},
|
||||
"3": {
|
||||
"content": "<repo_name>",
|
||||
"lstrip": false,
|
||||
"normalized": false,
|
||||
"rstrip": false,
|
||||
"single_word": false,
|
||||
"special": true
|
||||
},
|
||||
"4": {
|
||||
"content": "<reponame>",
|
||||
"lstrip": false,
|
||||
"normalized": false,
|
||||
"rstrip": false,
|
||||
"single_word": false,
|
||||
"special": true
|
||||
},
|
||||
"5": {
|
||||
"content": "<file_sep>",
|
||||
"lstrip": false,
|
||||
"normalized": false,
|
||||
"rstrip": false,
|
||||
"single_word": false,
|
||||
"special": true
|
||||
},
|
||||
"6": {
|
||||
"content": "<filename>",
|
||||
"lstrip": false,
|
||||
"normalized": false,
|
||||
"rstrip": false,
|
||||
"single_word": false,
|
||||
"special": true
|
||||
},
|
||||
"7": {
|
||||
"content": "<gh_stars>",
|
||||
"lstrip": false,
|
||||
"normalized": false,
|
||||
"rstrip": false,
|
||||
"single_word": false,
|
||||
"special": true
|
||||
},
|
||||
"8": {
|
||||
"content": "<issue_start>",
|
||||
"lstrip": false,
|
||||
"normalized": false,
|
||||
"rstrip": false,
|
||||
"single_word": false,
|
||||
"special": true
|
||||
},
|
||||
"9": {
|
||||
"content": "<issue_comment>",
|
||||
"lstrip": false,
|
||||
"normalized": false,
|
||||
"rstrip": false,
|
||||
"single_word": false,
|
||||
"special": true
|
||||
},
|
||||
"10": {
|
||||
"content": "<issue_closed>",
|
||||
"lstrip": false,
|
||||
"normalized": false,
|
||||
"rstrip": false,
|
||||
"single_word": false,
|
||||
"special": true
|
||||
},
|
||||
"11": {
|
||||
"content": "<jupyter_start>",
|
||||
"lstrip": false,
|
||||
"normalized": false,
|
||||
"rstrip": false,
|
||||
"single_word": false,
|
||||
"special": true
|
||||
},
|
||||
"12": {
|
||||
"content": "<jupyter_text>",
|
||||
"lstrip": false,
|
||||
"normalized": false,
|
||||
"rstrip": false,
|
||||
"single_word": false,
|
||||
"special": true
|
||||
},
|
||||
"13": {
|
||||
"content": "<jupyter_code>",
|
||||
"lstrip": false,
|
||||
"normalized": false,
|
||||
"rstrip": false,
|
||||
"single_word": false,
|
||||
"special": true
|
||||
},
|
||||
"14": {
|
||||
"content": "<jupyter_output>",
|
||||
"lstrip": false,
|
||||
"normalized": false,
|
||||
"rstrip": false,
|
||||
"single_word": false,
|
||||
"special": true
|
||||
},
|
||||
"15": {
|
||||
"content": "<jupyter_script>",
|
||||
"lstrip": false,
|
||||
"normalized": false,
|
||||
"rstrip": false,
|
||||
"single_word": false,
|
||||
"special": true
|
||||
},
|
||||
"16": {
|
||||
"content": "<empty_output>",
|
||||
"lstrip": false,
|
||||
"normalized": false,
|
||||
"rstrip": false,
|
||||
"single_word": false,
|
||||
"special": true
|
||||
},
|
||||
"49152": {
|
||||
"content": "<fake_token_around_image>",
|
||||
"lstrip": false,
|
||||
"normalized": false,
|
||||
"rstrip": false,
|
||||
"single_word": false,
|
||||
"special": true
|
||||
},
|
||||
"49153": {
|
||||
"content": "<image>",
|
||||
"lstrip": false,
|
||||
"normalized": false,
|
||||
"rstrip": false,
|
||||
"single_word": false,
|
||||
"special": true
|
||||
},
|
||||
"49154": {
|
||||
"content": "<end_of_utterance>",
|
||||
"lstrip": false,
|
||||
"normalized": false,
|
||||
"rstrip": false,
|
||||
"single_word": false,
|
||||
"special": true
|
||||
}
|
||||
},
|
||||
"additional_special_tokens": [
|
||||
"<fake_token_around_image>",
|
||||
"<image>",
|
||||
"<end_of_utterance>"
|
||||
],
|
||||
"bos_token": "<|im_start|>",
|
||||
"clean_up_tokenization_spaces": false,
|
||||
"eos_token": "<|im_end|>",
|
||||
"legacy": false,
|
||||
"model_max_length": 16384,
|
||||
"pad_token": "<|im_end|>",
|
||||
"processor_class": "Idefics3Processor",
|
||||
"tokenizer_class": "GPT2Tokenizer",
|
||||
"truncation_side": "left",
|
||||
"unk_token": "<|endoftext|>",
|
||||
"vocab_size": 49152
|
||||
}
|
File diff suppressed because one or more lines are too long
Loading…
Reference in New Issue