first commit
This commit is contained in:
parent
153e02c159
commit
d703ded761
173
README.md
173
README.md
|
@ -1,3 +1,172 @@
|
||||||
# SmolVLM_a13740444511891456137163
|
---
|
||||||
|
library_name: transformers
|
||||||
|
license: apache-2.0
|
||||||
|
datasets:
|
||||||
|
- HuggingFaceM4/the_cauldron
|
||||||
|
- HuggingFaceM4/Docmatix
|
||||||
|
pipeline_tag: image-text-to-text
|
||||||
|
language:
|
||||||
|
- en
|
||||||
|
base_model:
|
||||||
|
- HuggingFaceTB/SmolLM2-1.7B-Instruct
|
||||||
|
- google/siglip-so400m-patch14-384
|
||||||
|
---
|
||||||
|
<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/SmolVLM.png" width="800" height="auto" alt="Image description">
|
||||||
|
|
||||||
SmolVLM
|
# SmolVLM
|
||||||
|
|
||||||
|
SmolVLM is a compact open multimodal model that accepts arbitrary sequences of image and text inputs to produce text outputs.
|
||||||
|
Designed for efficiency, SmolVLM can answer questions about images, describe visual content, create stories grounded on multiple images,
|
||||||
|
or function as a pure language model without visual inputs. Its lightweight architecture makes it suitable for on-device applications
|
||||||
|
while maintaining strong performance on multimodal tasks.
|
||||||
|
|
||||||
|
## Model Summary
|
||||||
|
|
||||||
|
- **Developed by:** Hugging Face 🤗
|
||||||
|
- **Model type:** Multi-modal model (image+text)
|
||||||
|
- **Language(s) (NLP):** English
|
||||||
|
- **License:** Apache 2.0
|
||||||
|
- **Architecture:** Based on [Idefics3](https://huggingface.co/HuggingFaceM4/Idefics3-8B-Llama3) (see technical summary)
|
||||||
|
|
||||||
|
## Resources
|
||||||
|
|
||||||
|
- **Demo:** [SmolVLM Demo](https://huggingface.co/spaces/HuggingFaceTB/SmolVLM)
|
||||||
|
- **Blog:** [Blog post](https://huggingface.co/blog/smolvlm)
|
||||||
|
|
||||||
|
## Uses
|
||||||
|
|
||||||
|
SmolVLM can be used for inference on multimodal (image + text) tasks where the input comprises text queries along with one or more images.
|
||||||
|
Text and images can be interleaved arbitrarily, enabling tasks like image captioning, visual question answering, and storytelling based on
|
||||||
|
visual content. The model does not support image generation.
|
||||||
|
|
||||||
|
To fine-tune SmolVLM on a specific task, you can follow the fine-tuning tutorial.
|
||||||
|
<!-- todo: add link to fine-tuning tutorial -->
|
||||||
|
|
||||||
|
### Technical Summary
|
||||||
|
|
||||||
|
SmolVLM leverages the lightweight SmolLM2 language model to provide a compact yet powerful multimodal experience.
|
||||||
|
It introduces several changes compared to previous Idefics models:
|
||||||
|
|
||||||
|
- **Image compression:** We introduce a more radical image compression compared to Idefics3 to enable the model to infer faster and use less RAM.
|
||||||
|
- **Visual Token Encoding:** SmolVLM uses 81 visual tokens to encode image patches of size 384×384. Larger images are divided into patches, each encoded separately, enhancing efficiency without compromising performance.
|
||||||
|
|
||||||
|
More details about the training and architecture are available in our technical report.
|
||||||
|
|
||||||
|
|
||||||
|
### How to get started
|
||||||
|
|
||||||
|
You can use transformers to load, infer and fine-tune SmolVLM.
|
||||||
|
|
||||||
|
```python
|
||||||
|
import torch
|
||||||
|
from PIL import Image
|
||||||
|
from transformers import AutoProcessor, AutoModelForVision2Seq
|
||||||
|
from transformers.image_utils import load_image
|
||||||
|
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
|
||||||
|
# Load images
|
||||||
|
image1 = load_image("https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg")
|
||||||
|
image2 = load_image("https://huggingface.co/spaces/merve/chameleon-7b/resolve/main/bee.jpg")
|
||||||
|
# Initialize processor and model
|
||||||
|
processor = AutoProcessor.from_pretrained("HuggingFaceTB/SmolVLM-Base")
|
||||||
|
model = AutoModelForVision2Seq.from_pretrained(
|
||||||
|
"HuggingFaceTB/SmolVLM-Base",
|
||||||
|
torch_dtype=torch.bfloat16,
|
||||||
|
_attn_implementation="flash_attention_2" if DEVICE == "cuda" else "eager",
|
||||||
|
).to(DEVICE)
|
||||||
|
# Create input messages
|
||||||
|
messages = [
|
||||||
|
{
|
||||||
|
"role": "user",
|
||||||
|
"content": [
|
||||||
|
{"type": "image"},
|
||||||
|
{"type": "image"},
|
||||||
|
{"type": "text", "text": "Can you describe the two images?"}
|
||||||
|
]
|
||||||
|
},
|
||||||
|
]
|
||||||
|
# Prepare inputs
|
||||||
|
prompt = processor.apply_chat_template(messages, add_generation_prompt=True)
|
||||||
|
inputs = processor(text=prompt, images=[image1, image2], return_tensors="pt")
|
||||||
|
inputs = inputs.to(DEVICE)
|
||||||
|
# Generate outputs
|
||||||
|
generated_ids = model.generate(**inputs, max_new_tokens=500)
|
||||||
|
generated_texts = processor.batch_decode(
|
||||||
|
generated_ids,
|
||||||
|
skip_special_tokens=True,
|
||||||
|
)
|
||||||
|
print(generated_texts[0])
|
||||||
|
"""
|
||||||
|
User:<image>Can you describe the two images?
|
||||||
|
Assistant: I can describe the first one, but I can't describe the second one.
|
||||||
|
"""
|
||||||
|
```
|
||||||
|
|
||||||
|
|
||||||
|
### Model optimizations
|
||||||
|
|
||||||
|
**Precision**: For better performance, load and run the model in half-precision (`torch.float16` or `torch.bfloat16`) if your hardware supports it.
|
||||||
|
|
||||||
|
```python
|
||||||
|
from transformers import AutoModelForVision2Seq
|
||||||
|
import torch
|
||||||
|
model = AutoModelForVision2Seq.from_pretrained(
|
||||||
|
"HuggingFaceTB/SmolVLM-Base",
|
||||||
|
torch_dtype=torch.bfloat16
|
||||||
|
).to("cuda")
|
||||||
|
```
|
||||||
|
|
||||||
|
You can also load SmolVLM with 4/8-bit quantization using bitsandbytes, torchao or Quanto. Refer to [this page](https://huggingface.co/docs/transformers/en/main_classes/quantization) for other options.
|
||||||
|
|
||||||
|
```python
|
||||||
|
from transformers import AutoModelForVision2Seq, BitsAndBytesConfig
|
||||||
|
import torch
|
||||||
|
quantization_config = BitsAndBytesConfig(load_in_8bit=True)
|
||||||
|
model = AutoModelForVision2Seq.from_pretrained(
|
||||||
|
"HuggingFaceTB/SmolVLM-Base",
|
||||||
|
quantization_config=quantization_config,
|
||||||
|
)
|
||||||
|
```
|
||||||
|
|
||||||
|
**Vision Encoder Efficiency**: Adjust the image resolution by setting `size={"longest_edge": N*384}` when initializing the processor, where N is your desired value. The default `N=4` works well, which results in input images of
|
||||||
|
size 1536×1536. For documents, `N=5` might be beneficial. Decreasing N can save GPU memory and is appropriate for lower-resolution images. This is also useful if you want to fine-tune on videos.
|
||||||
|
|
||||||
|
|
||||||
|
## Misuse and Out-of-scope Use
|
||||||
|
|
||||||
|
SmolVLM is not intended for high-stakes scenarios or critical decision-making processes that affect an individual's well-being or livelihood. The model may produce content that appears factual but may not be accurate. Misuse includes, but is not limited to:
|
||||||
|
|
||||||
|
- Prohibited Uses:
|
||||||
|
- Evaluating or scoring individuals (e.g., in employment, education, credit)
|
||||||
|
- Critical automated decision-making
|
||||||
|
- Generating unreliable factual content
|
||||||
|
- Malicious Activities:
|
||||||
|
- Spam generation
|
||||||
|
- Disinformation campaigns
|
||||||
|
- Harassment or abuse
|
||||||
|
- Unauthorized surveillance
|
||||||
|
|
||||||
|
### License
|
||||||
|
|
||||||
|
SmolVLM is built upon [the shape-optimized SigLIP](https://huggingface.co/google/siglip-so400m-patch14-384) as image encoder and [SmolLM2](https://huggingface.co/HuggingFaceTB/SmolLM2-1.7B-Instruct) for text decoder part.
|
||||||
|
|
||||||
|
We release the SmolVLM checkpoints under the Apache 2.0 license.
|
||||||
|
|
||||||
|
## Training Details
|
||||||
|
|
||||||
|
### Training Data
|
||||||
|
|
||||||
|
The training data comes from [The Cauldron](https://huggingface.co/datasets/HuggingFaceM4/the_cauldron) and [Docmatix](https://huggingface.co/datasets/HuggingFaceM4/Docmatix) datasets, with emphasis on document understanding (25%) and image captioning (18%), while maintaining balanced coverage across other crucial capabilities like visual reasoning, chart comprehension, and general instruction following.
|
||||||
|
<img src="https://huggingface.co/HuggingFaceTB/SmolVLM-Instruct/resolve/main/mixture_the_cauldron.png" alt="Example Image" style="width:90%;" />
|
||||||
|
|
||||||
|
|
||||||
|
## Evaluation
|
||||||
|
|
||||||
|
| Model | MMMU (val) | MathVista (testmini) | MMStar (val) | DocVQA (test) | TextVQA (val) | Min GPU RAM required (GB) |
|
||||||
|
|-------------------|------------|----------------------|--------------|---------------|---------------|---------------------------|
|
||||||
|
| SmolVLM | 38.8 | 44.6 | 42.1 | 81.6 | 72.7 | 5.02 |
|
||||||
|
| Qwen-VL 2B | 41.1 | 47.8 | 47.5 | 90.1 | 79.7 | 13.70 |
|
||||||
|
| InternVL2 2B | 34.3 | 46.3 | 49.8 | 86.9 | 73.4 | 10.52 |
|
||||||
|
| PaliGemma 3B 448px| 34.9 | 28.7 | 48.3 | 32.2 | 56.0 | 6.72 |
|
||||||
|
| moondream2 | 32.4 | 24.3 | 40.3 | 70.5 | 65.2 | 3.87 |
|
||||||
|
| MiniCPM-V-2 | 38.2 | 39.8 | 39.1 | 71.9 | 74.1 | 7.88 |
|
||||||
|
| MM1.5 1B | 35.8 | 37.2 | 0.0 | 81.0 | 72.5 | NaN |
|
|
@ -0,0 +1,5 @@
|
||||||
|
{
|
||||||
|
"<end_of_utterance>": 49154,
|
||||||
|
"<fake_token_around_image>": 49152,
|
||||||
|
"<image>": 49153
|
||||||
|
}
|
|
@ -0,0 +1,3 @@
|
||||||
|
{
|
||||||
|
"chat_template": "<|im_start|>{% for message in messages %}{{message['role'].capitalize()}}{% if message['content'][0]['type'] == 'image' %}{{':'}}{% else %}{{': '}}{% endif %}{% for line in message['content'] %}{% if line['type'] == 'text' %}{{line['text']}}{% elif line['type'] == 'image' %}{{ '<image>' }}{% endif %}{% endfor %}<|endoftext|>\n{% endfor %}{% if add_generation_prompt %}{{ 'Assistant:' }}{% endif %}"
|
||||||
|
}
|
|
@ -0,0 +1,252 @@
|
||||||
|
{
|
||||||
|
"architectures": [
|
||||||
|
"Idefics3ForConditionalGeneration"
|
||||||
|
],
|
||||||
|
"image_seq_len": 81,
|
||||||
|
"image_token_id": 49153,
|
||||||
|
"model_type": "idefics3",
|
||||||
|
"scale_factor": 3,
|
||||||
|
"text_config": {
|
||||||
|
"_attn_implementation_autoset": false,
|
||||||
|
"_flash_attn_2_enabled": true,
|
||||||
|
"_name_or_path": "/fsx/m4/experiments/local_experiment_dir/s3_async_temporary_checkpoint_folder/tr_324_opt_400/unwrapped_model",
|
||||||
|
"add_cross_attention": false,
|
||||||
|
"architectures": [
|
||||||
|
"VLlama3ForCausalLM"
|
||||||
|
],
|
||||||
|
"attention_bias": false,
|
||||||
|
"attention_dropout": 0.0,
|
||||||
|
"bad_words_ids": null,
|
||||||
|
"begin_suppress_tokens": null,
|
||||||
|
"bos_token_id": 0,
|
||||||
|
"chunk_size_feed_forward": 0,
|
||||||
|
"cross_attention_hidden_size": null,
|
||||||
|
"decoder_start_token_id": null,
|
||||||
|
"diversity_penalty": 0.0,
|
||||||
|
"do_sample": false,
|
||||||
|
"early_stopping": false,
|
||||||
|
"encoder_no_repeat_ngram_size": 0,
|
||||||
|
"eos_token_id": 0,
|
||||||
|
"exponential_decay_length_penalty": null,
|
||||||
|
"finetuning_task": null,
|
||||||
|
"forced_bos_token_id": null,
|
||||||
|
"forced_eos_token_id": null,
|
||||||
|
"head_dim": 64,
|
||||||
|
"hidden_act": "silu",
|
||||||
|
"hidden_size": 2048,
|
||||||
|
"id2label": {
|
||||||
|
"0": "LABEL_0",
|
||||||
|
"1": "LABEL_1"
|
||||||
|
},
|
||||||
|
"initializer_range": 0.02,
|
||||||
|
"intermediate_size": 8192,
|
||||||
|
"is_decoder": false,
|
||||||
|
"is_encoder_decoder": false,
|
||||||
|
"label2id": {
|
||||||
|
"LABEL_0": 0,
|
||||||
|
"LABEL_1": 1
|
||||||
|
},
|
||||||
|
"length_penalty": 1.0,
|
||||||
|
"max_length": 20,
|
||||||
|
"max_position_embeddings": 16384,
|
||||||
|
"min_length": 0,
|
||||||
|
"mlp_bias": false,
|
||||||
|
"model_type": "llama",
|
||||||
|
"neftune_noise_alpha": 0.0,
|
||||||
|
"no_repeat_ngram_size": 0,
|
||||||
|
"num_attention_heads": 32,
|
||||||
|
"num_beam_groups": 1,
|
||||||
|
"num_beams": 1,
|
||||||
|
"num_hidden_layers": 24,
|
||||||
|
"num_key_value_heads": 32,
|
||||||
|
"num_return_sequences": 1,
|
||||||
|
"output_attentions": false,
|
||||||
|
"output_hidden_states": false,
|
||||||
|
"output_scores": false,
|
||||||
|
"pad_token_id": 2,
|
||||||
|
"perceiver_config": {
|
||||||
|
"_attn_implementation_autoset": false,
|
||||||
|
"_name_or_path": "",
|
||||||
|
"add_cross_attention": false,
|
||||||
|
"architectures": null,
|
||||||
|
"attention_dropout": 0.0,
|
||||||
|
"bad_words_ids": null,
|
||||||
|
"begin_suppress_tokens": null,
|
||||||
|
"bos_token_id": null,
|
||||||
|
"chunk_size_feed_forward": 0,
|
||||||
|
"cross_attention_hidden_size": null,
|
||||||
|
"decoder_start_token_id": null,
|
||||||
|
"diversity_penalty": 0.0,
|
||||||
|
"do_sample": false,
|
||||||
|
"early_stopping": false,
|
||||||
|
"encoder_no_repeat_ngram_size": 0,
|
||||||
|
"eos_token_id": null,
|
||||||
|
"exponential_decay_length_penalty": null,
|
||||||
|
"finetuning_task": null,
|
||||||
|
"forced_bos_token_id": null,
|
||||||
|
"forced_eos_token_id": null,
|
||||||
|
"hidden_act": "silu",
|
||||||
|
"id2label": {
|
||||||
|
"0": "LABEL_0",
|
||||||
|
"1": "LABEL_1"
|
||||||
|
},
|
||||||
|
"is_decoder": false,
|
||||||
|
"is_encoder_decoder": false,
|
||||||
|
"label2id": {
|
||||||
|
"LABEL_0": 0,
|
||||||
|
"LABEL_1": 1
|
||||||
|
},
|
||||||
|
"length_penalty": 1.0,
|
||||||
|
"max_length": 20,
|
||||||
|
"min_length": 0,
|
||||||
|
"model_type": "vllama3",
|
||||||
|
"no_repeat_ngram_size": 0,
|
||||||
|
"num_beam_groups": 1,
|
||||||
|
"num_beams": 1,
|
||||||
|
"num_key_value_heads": 1,
|
||||||
|
"num_return_sequences": 1,
|
||||||
|
"output_attentions": false,
|
||||||
|
"output_hidden_states": false,
|
||||||
|
"output_scores": false,
|
||||||
|
"pad_token_id": null,
|
||||||
|
"prefix": null,
|
||||||
|
"problem_type": null,
|
||||||
|
"pruned_heads": {},
|
||||||
|
"qk_layer_norms_perceiver": false,
|
||||||
|
"remove_invalid_values": false,
|
||||||
|
"repetition_penalty": 1.0,
|
||||||
|
"resampler_depth": 6,
|
||||||
|
"resampler_head_dim": 96,
|
||||||
|
"resampler_n_heads": 16,
|
||||||
|
"resampler_n_latents": 64,
|
||||||
|
"return_dict": true,
|
||||||
|
"return_dict_in_generate": false,
|
||||||
|
"sep_token_id": null,
|
||||||
|
"suppress_tokens": null,
|
||||||
|
"task_specific_params": null,
|
||||||
|
"temperature": 1.0,
|
||||||
|
"tf_legacy_loss": false,
|
||||||
|
"tie_encoder_decoder": false,
|
||||||
|
"tie_word_embeddings": true,
|
||||||
|
"tokenizer_class": null,
|
||||||
|
"top_k": 50,
|
||||||
|
"top_p": 1.0,
|
||||||
|
"torch_dtype": null,
|
||||||
|
"torchscript": false,
|
||||||
|
"transformers_version": "4.46.0",
|
||||||
|
"typical_p": 1.0,
|
||||||
|
"use_bfloat16": false
|
||||||
|
},
|
||||||
|
"prefix": null,
|
||||||
|
"pretraining_tp": 1,
|
||||||
|
"problem_type": null,
|
||||||
|
"pruned_heads": {},
|
||||||
|
"qk_layer_norms": false,
|
||||||
|
"remove_invalid_values": false,
|
||||||
|
"repetition_penalty": 1.0,
|
||||||
|
"return_dict": true,
|
||||||
|
"return_dict_in_generate": false,
|
||||||
|
"rms_norm_eps": 1e-05,
|
||||||
|
"rope_scaling": null,
|
||||||
|
"rope_theta": 273768.0,
|
||||||
|
"sep_token_id": null,
|
||||||
|
"suppress_tokens": null,
|
||||||
|
"task_specific_params": null,
|
||||||
|
"temperature": 1.0,
|
||||||
|
"tf_legacy_loss": false,
|
||||||
|
"tie_encoder_decoder": false,
|
||||||
|
"tie_word_embeddings": false,
|
||||||
|
"tokenizer_class": null,
|
||||||
|
"top_k": 50,
|
||||||
|
"top_p": 1.0,
|
||||||
|
"torch_dtype": "bfloat16",
|
||||||
|
"torchscript": false,
|
||||||
|
"typical_p": 1.0,
|
||||||
|
"use_bfloat16": false,
|
||||||
|
"use_cache": true,
|
||||||
|
"use_resampler": false,
|
||||||
|
"vocab_size": 49155
|
||||||
|
},
|
||||||
|
"tie_word_embeddings": false,
|
||||||
|
"torch_dtype": "bfloat16",
|
||||||
|
"transformers_version": "4.46.0",
|
||||||
|
"use_cache": true,
|
||||||
|
"vision_config": {
|
||||||
|
"size": {"longest_edge": 1920},
|
||||||
|
"max_image_size": {"longest_edge": 384},
|
||||||
|
"_attn_implementation_autoset": false,
|
||||||
|
"_name_or_path": "",
|
||||||
|
"add_cross_attention": false,
|
||||||
|
"architectures": null,
|
||||||
|
"attention_dropout": 0.0,
|
||||||
|
"bad_words_ids": null,
|
||||||
|
"begin_suppress_tokens": null,
|
||||||
|
"bos_token_id": null,
|
||||||
|
"chunk_size_feed_forward": 0,
|
||||||
|
"cross_attention_hidden_size": null,
|
||||||
|
"decoder_start_token_id": null,
|
||||||
|
"diversity_penalty": 0.0,
|
||||||
|
"do_sample": false,
|
||||||
|
"early_stopping": false,
|
||||||
|
"encoder_no_repeat_ngram_size": 0,
|
||||||
|
"eos_token_id": null,
|
||||||
|
"exponential_decay_length_penalty": null,
|
||||||
|
"finetuning_task": null,
|
||||||
|
"forced_bos_token_id": null,
|
||||||
|
"forced_eos_token_id": null,
|
||||||
|
"hidden_act": "gelu_pytorch_tanh",
|
||||||
|
"hidden_size": 1152,
|
||||||
|
"id2label": {
|
||||||
|
"0": "LABEL_0",
|
||||||
|
"1": "LABEL_1"
|
||||||
|
},
|
||||||
|
"image_size": 384,
|
||||||
|
"initializer_range": 0.02,
|
||||||
|
"intermediate_size": 4304,
|
||||||
|
"is_decoder": false,
|
||||||
|
"is_encoder_decoder": false,
|
||||||
|
"label2id": {
|
||||||
|
"LABEL_0": 0,
|
||||||
|
"LABEL_1": 1
|
||||||
|
},
|
||||||
|
"layer_norm_eps": 1e-06,
|
||||||
|
"length_penalty": 1.0,
|
||||||
|
"max_length": 20,
|
||||||
|
"min_length": 0,
|
||||||
|
"model_type": "idefics3",
|
||||||
|
"no_repeat_ngram_size": 0,
|
||||||
|
"num_attention_heads": 16,
|
||||||
|
"num_beam_groups": 1,
|
||||||
|
"num_beams": 1,
|
||||||
|
"num_channels": 3,
|
||||||
|
"num_hidden_layers": 27,
|
||||||
|
"num_return_sequences": 1,
|
||||||
|
"output_attentions": false,
|
||||||
|
"output_hidden_states": false,
|
||||||
|
"output_scores": false,
|
||||||
|
"pad_token_id": null,
|
||||||
|
"patch_size": 14,
|
||||||
|
"prefix": null,
|
||||||
|
"problem_type": null,
|
||||||
|
"pruned_heads": {},
|
||||||
|
"remove_invalid_values": false,
|
||||||
|
"repetition_penalty": 1.0,
|
||||||
|
"return_dict": true,
|
||||||
|
"return_dict_in_generate": false,
|
||||||
|
"sep_token_id": null,
|
||||||
|
"suppress_tokens": null,
|
||||||
|
"task_specific_params": null,
|
||||||
|
"temperature": 1.0,
|
||||||
|
"tf_legacy_loss": false,
|
||||||
|
"tie_encoder_decoder": false,
|
||||||
|
"tie_word_embeddings": false,
|
||||||
|
"tokenizer_class": null,
|
||||||
|
"top_k": 50,
|
||||||
|
"top_p": 1.0,
|
||||||
|
"torch_dtype": null,
|
||||||
|
"torchscript": false,
|
||||||
|
"typical_p": 1.0,
|
||||||
|
"use_bfloat16": false
|
||||||
|
},
|
||||||
|
"vocab_size": 49155
|
||||||
|
}
|
|
@ -0,0 +1 @@
|
||||||
|
{"framework": "pytorch", "task": "image-text-to-text", "allow_remote": true}
|
|
@ -0,0 +1,7 @@
|
||||||
|
{
|
||||||
|
"_from_model_config": true,
|
||||||
|
"bos_token_id": 0,
|
||||||
|
"eos_token_id": 0,
|
||||||
|
"pad_token_id": 2,
|
||||||
|
"transformers_version": "4.46.0"
|
||||||
|
}
|
File diff suppressed because it is too large
Load Diff
Binary file not shown.
|
@ -0,0 +1,28 @@
|
||||||
|
{
|
||||||
|
"do_convert_rgb": true,
|
||||||
|
"do_image_splitting": true,
|
||||||
|
"do_normalize": true,
|
||||||
|
"do_pad": true,
|
||||||
|
"do_rescale": true,
|
||||||
|
"do_resize": true,
|
||||||
|
"image_mean": [
|
||||||
|
0.5,
|
||||||
|
0.5,
|
||||||
|
0.5
|
||||||
|
],
|
||||||
|
"image_processor_type": "Idefics3ImageProcessor",
|
||||||
|
"image_std": [
|
||||||
|
0.5,
|
||||||
|
0.5,
|
||||||
|
0.5
|
||||||
|
],
|
||||||
|
"max_image_size": {
|
||||||
|
"longest_edge": 384
|
||||||
|
},
|
||||||
|
"processor_class": "Idefics3Processor",
|
||||||
|
"resample": 1,
|
||||||
|
"rescale_factor": 0.00392156862745098,
|
||||||
|
"size": {
|
||||||
|
"longest_edge": 1536
|
||||||
|
}
|
||||||
|
}
|
|
@ -0,0 +1,4 @@
|
||||||
|
{
|
||||||
|
"image_seq_len": 81,
|
||||||
|
"processor_class": "Idefics3Processor"
|
||||||
|
}
|
|
@ -0,0 +1,53 @@
|
||||||
|
{
|
||||||
|
"additional_special_tokens": [
|
||||||
|
{
|
||||||
|
"content": "<fake_token_around_image>",
|
||||||
|
"lstrip": false,
|
||||||
|
"normalized": false,
|
||||||
|
"rstrip": false,
|
||||||
|
"single_word": false
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"content": "<image>",
|
||||||
|
"lstrip": false,
|
||||||
|
"normalized": false,
|
||||||
|
"rstrip": false,
|
||||||
|
"single_word": false
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"content": "<end_of_utterance>",
|
||||||
|
"lstrip": false,
|
||||||
|
"normalized": false,
|
||||||
|
"rstrip": false,
|
||||||
|
"single_word": false
|
||||||
|
}
|
||||||
|
],
|
||||||
|
"bos_token": {
|
||||||
|
"content": "<|im_start|>",
|
||||||
|
"lstrip": false,
|
||||||
|
"normalized": false,
|
||||||
|
"rstrip": false,
|
||||||
|
"single_word": false
|
||||||
|
},
|
||||||
|
"eos_token": {
|
||||||
|
"content": "<|im_end|>",
|
||||||
|
"lstrip": false,
|
||||||
|
"normalized": false,
|
||||||
|
"rstrip": false,
|
||||||
|
"single_word": false
|
||||||
|
},
|
||||||
|
"pad_token": {
|
||||||
|
"content": "<|im_end|>",
|
||||||
|
"lstrip": false,
|
||||||
|
"normalized": false,
|
||||||
|
"rstrip": false,
|
||||||
|
"single_word": false
|
||||||
|
},
|
||||||
|
"unk_token": {
|
||||||
|
"content": "<|endoftext|>",
|
||||||
|
"lstrip": false,
|
||||||
|
"normalized": false,
|
||||||
|
"rstrip": false,
|
||||||
|
"single_word": false
|
||||||
|
}
|
||||||
|
}
|
File diff suppressed because it is too large
Load Diff
|
@ -0,0 +1,181 @@
|
||||||
|
{
|
||||||
|
"add_prefix_space": false,
|
||||||
|
"added_tokens_decoder": {
|
||||||
|
"0": {
|
||||||
|
"content": "<|endoftext|>",
|
||||||
|
"lstrip": false,
|
||||||
|
"normalized": false,
|
||||||
|
"rstrip": false,
|
||||||
|
"single_word": false,
|
||||||
|
"special": true
|
||||||
|
},
|
||||||
|
"1": {
|
||||||
|
"content": "<|im_start|>",
|
||||||
|
"lstrip": false,
|
||||||
|
"normalized": false,
|
||||||
|
"rstrip": false,
|
||||||
|
"single_word": false,
|
||||||
|
"special": true
|
||||||
|
},
|
||||||
|
"2": {
|
||||||
|
"content": "<|im_end|>",
|
||||||
|
"lstrip": false,
|
||||||
|
"normalized": false,
|
||||||
|
"rstrip": false,
|
||||||
|
"single_word": false,
|
||||||
|
"special": true
|
||||||
|
},
|
||||||
|
"3": {
|
||||||
|
"content": "<repo_name>",
|
||||||
|
"lstrip": false,
|
||||||
|
"normalized": false,
|
||||||
|
"rstrip": false,
|
||||||
|
"single_word": false,
|
||||||
|
"special": true
|
||||||
|
},
|
||||||
|
"4": {
|
||||||
|
"content": "<reponame>",
|
||||||
|
"lstrip": false,
|
||||||
|
"normalized": false,
|
||||||
|
"rstrip": false,
|
||||||
|
"single_word": false,
|
||||||
|
"special": true
|
||||||
|
},
|
||||||
|
"5": {
|
||||||
|
"content": "<file_sep>",
|
||||||
|
"lstrip": false,
|
||||||
|
"normalized": false,
|
||||||
|
"rstrip": false,
|
||||||
|
"single_word": false,
|
||||||
|
"special": true
|
||||||
|
},
|
||||||
|
"6": {
|
||||||
|
"content": "<filename>",
|
||||||
|
"lstrip": false,
|
||||||
|
"normalized": false,
|
||||||
|
"rstrip": false,
|
||||||
|
"single_word": false,
|
||||||
|
"special": true
|
||||||
|
},
|
||||||
|
"7": {
|
||||||
|
"content": "<gh_stars>",
|
||||||
|
"lstrip": false,
|
||||||
|
"normalized": false,
|
||||||
|
"rstrip": false,
|
||||||
|
"single_word": false,
|
||||||
|
"special": true
|
||||||
|
},
|
||||||
|
"8": {
|
||||||
|
"content": "<issue_start>",
|
||||||
|
"lstrip": false,
|
||||||
|
"normalized": false,
|
||||||
|
"rstrip": false,
|
||||||
|
"single_word": false,
|
||||||
|
"special": true
|
||||||
|
},
|
||||||
|
"9": {
|
||||||
|
"content": "<issue_comment>",
|
||||||
|
"lstrip": false,
|
||||||
|
"normalized": false,
|
||||||
|
"rstrip": false,
|
||||||
|
"single_word": false,
|
||||||
|
"special": true
|
||||||
|
},
|
||||||
|
"10": {
|
||||||
|
"content": "<issue_closed>",
|
||||||
|
"lstrip": false,
|
||||||
|
"normalized": false,
|
||||||
|
"rstrip": false,
|
||||||
|
"single_word": false,
|
||||||
|
"special": true
|
||||||
|
},
|
||||||
|
"11": {
|
||||||
|
"content": "<jupyter_start>",
|
||||||
|
"lstrip": false,
|
||||||
|
"normalized": false,
|
||||||
|
"rstrip": false,
|
||||||
|
"single_word": false,
|
||||||
|
"special": true
|
||||||
|
},
|
||||||
|
"12": {
|
||||||
|
"content": "<jupyter_text>",
|
||||||
|
"lstrip": false,
|
||||||
|
"normalized": false,
|
||||||
|
"rstrip": false,
|
||||||
|
"single_word": false,
|
||||||
|
"special": true
|
||||||
|
},
|
||||||
|
"13": {
|
||||||
|
"content": "<jupyter_code>",
|
||||||
|
"lstrip": false,
|
||||||
|
"normalized": false,
|
||||||
|
"rstrip": false,
|
||||||
|
"single_word": false,
|
||||||
|
"special": true
|
||||||
|
},
|
||||||
|
"14": {
|
||||||
|
"content": "<jupyter_output>",
|
||||||
|
"lstrip": false,
|
||||||
|
"normalized": false,
|
||||||
|
"rstrip": false,
|
||||||
|
"single_word": false,
|
||||||
|
"special": true
|
||||||
|
},
|
||||||
|
"15": {
|
||||||
|
"content": "<jupyter_script>",
|
||||||
|
"lstrip": false,
|
||||||
|
"normalized": false,
|
||||||
|
"rstrip": false,
|
||||||
|
"single_word": false,
|
||||||
|
"special": true
|
||||||
|
},
|
||||||
|
"16": {
|
||||||
|
"content": "<empty_output>",
|
||||||
|
"lstrip": false,
|
||||||
|
"normalized": false,
|
||||||
|
"rstrip": false,
|
||||||
|
"single_word": false,
|
||||||
|
"special": true
|
||||||
|
},
|
||||||
|
"49152": {
|
||||||
|
"content": "<fake_token_around_image>",
|
||||||
|
"lstrip": false,
|
||||||
|
"normalized": false,
|
||||||
|
"rstrip": false,
|
||||||
|
"single_word": false,
|
||||||
|
"special": true
|
||||||
|
},
|
||||||
|
"49153": {
|
||||||
|
"content": "<image>",
|
||||||
|
"lstrip": false,
|
||||||
|
"normalized": false,
|
||||||
|
"rstrip": false,
|
||||||
|
"single_word": false,
|
||||||
|
"special": true
|
||||||
|
},
|
||||||
|
"49154": {
|
||||||
|
"content": "<end_of_utterance>",
|
||||||
|
"lstrip": false,
|
||||||
|
"normalized": false,
|
||||||
|
"rstrip": false,
|
||||||
|
"single_word": false,
|
||||||
|
"special": true
|
||||||
|
}
|
||||||
|
},
|
||||||
|
"additional_special_tokens": [
|
||||||
|
"<fake_token_around_image>",
|
||||||
|
"<image>",
|
||||||
|
"<end_of_utterance>"
|
||||||
|
],
|
||||||
|
"bos_token": "<|im_start|>",
|
||||||
|
"clean_up_tokenization_spaces": false,
|
||||||
|
"eos_token": "<|im_end|>",
|
||||||
|
"legacy": false,
|
||||||
|
"model_max_length": 16384,
|
||||||
|
"pad_token": "<|im_end|>",
|
||||||
|
"processor_class": "Idefics3Processor",
|
||||||
|
"tokenizer_class": "GPT2Tokenizer",
|
||||||
|
"truncation_side": "left",
|
||||||
|
"unk_token": "<|endoftext|>",
|
||||||
|
"vocab_size": 49152
|
||||||
|
}
|
File diff suppressed because one or more lines are too long
Loading…
Reference in New Issue