first commit

2025-02-25 09:26:49 +08:00 · 2025-02-25 09:26:49 +08:00 · 412cb3c21a
parent c73dc99a1c
commit 412cb3c21a
15 changed files with 2838 additions and 2 deletions
--- a/README.md
+++ b/README.md
@ -1,3 +1,399 @@
-# Baichuan-M1-14B-Instruct
-
+---
+language:
+- en
+- zh
+tags:
+- medical
+---
+<div align="center">
+<h1>
  Baichuan-M1-14B-Instruct
+</h1>
+</div>
+
+<p align="center">
+🤗 <a href="https://huggingface.co/baichuan-inc/Baichuan-M1-14B-Base" target="_blank">Baichuan-M1-14B-Base</a> • 🤗 <a href="https://huggingface.co/baichuan-inc/Baichuan-M1-14B-Instruct" target="_blank">Baichuan-M1-14B-Instruct</a> • 📗 <a href="https://arxiv.org/abs/2502.12671" target="_blank">Technical Report</a> • 💬 <a href="https://y41.8if.cn/JQCj6n" target="_blank">WeChat</a>
+</p>
+
+
+---
+
+# 📖 Table of Contents
+
+- [🏁 Model Introduction](#intro)
+- [🔬 Data Collection and Processing](#data)
+- [🧠 New Model Architecture](#structure)
+- [⚙️ Training Methodology](#training)
+- [📊 Benchmark Results](#benchmark)
+- [🚀 Quick Start](#quick)
+- [📜 License and Statement](#declare)
+- [🏷️ Reference](#reference)
+
+---
+<a name="intro"></a>
+# 🏁 Model Introduction
+
+**Baichuan-14B-M1** is the industry's first open-source large language model developed from scratch by Baichuan Intelligence, specifically optimized for medical scenarios. While excelling in general capabilities, it demonstrates powerful performance in the medical field. It achieves results comparable to models of similar size in most general benchmark evaluations, while outperforming models five times larger in medical scenarios. Below are the core features of the model:
+
+- Trained from scratch on **20 trillion tokens** of high-quality medical and general data.
+- Specialized modeling for **20+ medical departments** with fine-grained medical expertise.
+- Introduces **innovative model architecture**, significantly improving context understanding and long-sequence task performance.
+- Provides **[🤗 Base Model](https://huggingface.co/baichuan-inc/Baichuan-M1-14B-Base)** and **[🤗 Instruct Model](https://huggingface.co/baichuan-inc/Baichuan-M1-14B-Instruct)**.
+
+
+---
+<a name="data"></a>
+# 🔬 Data Collection and Processing
+
+## Medical Data Collection
+
+We conducted meticulous data collection and synthesis for the medical field, including:
+
+- **Tens of millions of professional medical data**: Chinese/English professional papers, medical cases, medical textbooks, knowledge bases, etc.
+- **Hundreds of millions of medical Q&A and clinical data**: Covering complex medical reasoning and real-world clinical cases.
+- **Comprehensive data classification and evaluation**: Categorized by medical departments, content, and value to ensure balanced data distribution and filter out truly valuable medical data.
+
+## Data Synthesis and Optimization
+
+- **Synthetic data design**: Combining knowledge graphs, cases, and textbooks to generate diverse, high-quality medical reasoning data.
+- **Self-reflection mechanism and reward model**: Continuously improving the quality of synthetic data, ultimately generating **nearly a trillion tokens** of reasoning data, covering long-tail knowledge and complex scenarios.
+
+
+## General Data Collection
+
+- **20T multilingual general dataset**: Including 14T English data, 4T Chinese data, and 2T data covering 30 mainstream languages.
+- **Deduplication and upsampling strategy**: Upsampling high-quality data to significantly enhance model performance.
+- **27 global knowledge categories**: Optimizing data ratios based on small model experiments to balance general and domain-specific capabilities.
+
+---
+<a name="structure"></a>
+# 🧠 New Model Architecture
+
+## Short Convolution Attention Mechanism
+
+- By introducing lightweight short convolution operations when computing Key and Value, the reliance of standard Transformer models on induction heads for learning is significantly reduced. Traditional Transformers rely on induction heads to capture repetitive patterns and contextual dependencies in sequences, which requires a certain model width and depth. Short convolution decouples the Key and Value sequences in the time dimension, enhancing context learning capabilities. Extensive experiments from toy models to models with over ten billion parameters show that the short convolution attention mechanism excels in language modeling tasks, especially those heavily dependent on contextual information.
+
+
+## Sliding Window Attention Mechanism
+
+- Adopting a sliding window attention mechanism in some layers to reduce KV Cache memory usage.
+- **Optimization**: Balancing computational efficiency and performance, especially suitable for long-sequence tasks.
+
+## Optimizing Position Encoding Oscillation
+
+- By increasing the dimensions of some attention heads, RoPE curve oscillation is reduced.
+- **Result**: More stable performance in long-sequence tasks while maintaining the model's ability to capture diverse features.
+
+## High Peak Learning Rate Strategy
+
+- Using **WSD learning rate scheduling strategy** with high peak learning rates to promote model generalization.
+- **Comparison results**: Significant improvement in benchmark task performance.
+
+## Adaptive Gradient Update
+
+- **Dynamic gradient clipping**: Skipping updates when gradients are too large to reduce instability caused by special samples or steep loss spaces.
+
+---
+<a name="training"></a>
+# ⚙️ Training Methodology
+
+We innovatively adopted a **multi-stage curriculum learning and alignment optimization** approach, systematically enhancing model capabilities through the following two parts:
+
+## 1. Multi-Stage Curriculum Learning
+
+Training is divided into three stages, progressively optimizing the model's general and medical domain capabilities:
+
+1. **General Knowledge Enhancement Stage**: Focused on general language modeling to improve basic language and common sense.
+2. **Medical Basic Knowledge Enhancement Stage**: Introducing high-quality medical data to enhance reasoning, mathematical, and medical knowledge.
+3. **Medical Advanced Knowledge Enhancement Stage**: Further optimizing data quality, focusing on complex medical reasoning, disease diagnosis, and long-tail knowledge.
+
+## 2. Alignment Optimization
+
+Enhancing model generation quality, logical reasoning, and user preference alignment through reinforcement learning and pairwise data optimization:
+
+1. **Pairwise Data**: Covering multi-turn dialogues, instruction following, math and code, and reasoning tasks, sourced from human annotations and multi-model generation.
+2. **Optimization Process**:
+   - **ELO**: Optimizing diverse, high-quality chain-of-thought generation based on maximum likelihood.
+   - **TDPO**: Using pairwise data to optimize the generation model for better user preference alignment.
+   - **PPO**: Further enhancing generation logic and task performance through policy optimization.
+     
+
+This combined approach of multi-stage and alignment optimization enables the model to achieve exceptional performance in both general and medical domain capabilities.
+
+---
+<a name="benchmark"></a>
+# 📊 Benchmark Results
+
+Our evaluation covers all mainstream benchmarks, achieving excellent metrics in both open-source and closed-source evaluations, demonstrating outstanding medical scenario capabilities while maintaining strong general performance.
+
+<table style="border: 1px solid #000; border-collapse: collapse; width: 100%; text-align: center;">
+    <thead>
+        <tr>
+            <th>Category</th>
+            <th>Benchmark</th>
+            <th style="font-size:15px;">Baichuan-M1-14B-Instruct</th>
+            <th style="font-size:15px;">Qwen2.5-14B-Instruct</th>
+            <th style="font-size:15px;">Qwen2.5-72B-Instruct</th>
+            <th style="font-size:15px;">claude-3.5-sonnet-20241022</th>
+            <th style="font-size:15px;">gpt-4o</th>
+        </tr>
+    </thead>
+    <tbody>
+        <tr>
+            <td colspan="2" style="text-align: center;">Average Score</td>
+            <td>72.23</td>
+            <td>65.39</td>
+            <td>70.51</td>
+            <td>74.85</td>
+            <td>75.00</td>
+        </tr>
+        <tr>
+            <td rowspan="7" style="vertical-align: middle;">Clinical Practice</td>
+            <td style="text-align: left;">cmbclin</td>
+            <td>77.40</td>
+            <td>71.51</td>
+            <td>75.36</td>
+            <td>78.37</td>
+            <td>75.36</td>
+        </tr>
+        <tr>
+            <td style="text-align: left;">clinicalbench_diag</td>
+            <td>70.90</td>
+            <td>68.85</td>
+            <td>72.23</td>
+            <td>75.00</td>
+            <td>73.05</td>
+        </tr>
+        <tr>
+            <td style="text-align: left;">clinicalbench_hos</td>
+            <td>70.05</td>
+            <td>68.83</td>
+            <td>70.53</td>
+            <td>65.58</td>
+            <td>69.38</td>
+        </tr>
+        <tr>
+            <td style="text-align: left;">clinicalbench_treat</td>
+            <td>56.38</td>
+            <td>55.03</td>
+            <td>57.30</td>
+            <td>64.03</td>
+            <td>59.35</td>
+        </tr>
+        <tr>
+            <td style="text-align: left;">rarearena_rdc</td>
+            <td>81.80</td>
+            <td>66.40</td>
+            <td>76.20</td>
+            <td>89.60</td>
+            <td>88.40</td>
+        </tr>
+        <tr>
+            <td style="text-align: left;">rarearena_rds</td>
+            <td>54.00</td>
+            <td>42.60</td>
+            <td>49.80</td>
+            <td>59.80</td>
+            <td>57.20</td>
+        </tr>
+        <tr>
+            <td style="text-align: left;">rarebench</td>
+            <td>59.60</td>
+            <td>52.80</td>
+            <td>60.60</td>
+            <td>65.30</td>
+            <td>62.80</td>
+        </tr>
+        <tr>
+            <td rowspan="10" style="vertical-align: middle;">Exams</td>
+            <td style="text-align: left;">cmexam</td>
+            <td>80.10</td>
+            <td>77.70</td>
+            <td>82.70</td>
+            <td>77.50</td>
+            <td>78.00</td>
+        </tr>
+        <tr>
+            <td style="text-align: left;">Pediatric Qualification Exam</td>
+            <td>78.48</td>
+            <td>74.68</td>
+            <td>84.81</td>
+            <td>76.58</td>
+            <td>78.48</td>
+        </tr>
+        <tr>
+            <td style="text-align: left;">Internal Medicine Qualification Exam</td>
+            <td>83.42</td>
+            <td>86.10</td>
+            <td>87.17</td>
+            <td>87.70</td>
+            <td>83.42</td>
+        </tr>
+        <tr>
+            <td style="text-align: left;">General Practice Qualification Exam</td>
+            <td>87.07</td>
+            <td>88.44</td>
+            <td>88.44</td>
+            <td>81.63</td>
+            <td>84.35</td>
+        </tr>
+        <tr>
+            <td style="text-align: left;">USMLE</td>
+            <td>78.00</td>
+            <td>67.20</td>
+            <td>76.70</td>
+            <td>85.90</td>
+            <td>87.10</td>
+        </tr>
+        <tr>
+            <td style="text-align: left;">medbullets</td>
+            <td>66.88</td>
+            <td>54.22</td>
+            <td>64.29</td>
+            <td>72.40</td>
+            <td>75.97</td>
+        </tr>
+        <tr>
+            <td style="text-align: left;">mediq</td>
+            <td>83.40</td>
+            <td>66.80</td>
+            <td>79.90</td>
+            <td>88.80</td>
+            <td>90.20</td>
+        </tr>
+        <tr>
+            <td style="text-align: left;">nejmqa</td>
+            <td>49.75</td>
+            <td>45.69</td>
+            <td>50.76</td>
+            <td>69.54</td>
+            <td>54.31</td>
+        </tr>
+        <tr>
+            <td style="text-align: left;">pubmedqa</td>
+            <td>75.20</td>
+            <td>76.40</td>
+            <td>75.60</td>
+            <td>77.00</td>
+            <td>77.60</td>
+        </tr>
+        <tr>
+            <td style="text-align: left;">redisqa</td>
+            <td>74.50</td>
+            <td>69.70</td>
+            <td>75.00</td>
+            <td>83.20</td>
+            <td>82.80</td>
+        </tr>
+        <tr>
+            <td rowspan="5" style="vertical-align: middle;">Basic Capabilities</td>
+            <td style="text-align: left;">mednli_dis</td>
+            <td>80.40</td>
+            <td>68.90</td>
+            <td>74.90</td>
+            <td>58.30</td>
+            <td>79.80</td>
+        </tr>
+        <tr>
+            <td style="text-align: left;">medcalc</td>
+            <td>56.00</td>
+            <td>31.40</td>
+            <td>37.90</td>
+            <td>52.60</td>
+            <td>49.00</td>
+        </tr>
+        <tr>
+            <td style="text-align: left;">MMLU-anatomy</td>
+            <td>80.00</td>
+            <td>67.41</td>
+            <td>71.11</td>
+            <td>86.67</td>
+            <td>91.11</td>
+        </tr>
+        <tr>
+            <td style="text-align: left;">MMLU-virology</td>
+            <td>54.82</td>
+            <td>56.02</td>
+            <td>53.01</td>
+            <td>54.22</td>
+            <td>57.23</td>
+        </tr>
+        <tr>
+            <td style="text-align: left;">MMLU-genetics</td>
+            <td>91.00</td>
+            <td>82.00</td>
+            <td>87.00</td>
+            <td>97.00</td>
+            <td>95.00</td>
+        </tr>
+    </tbody>
+</table>
+
+
+---
+<a name="quick"></a>
+# 🚀 Quick Start
+
+### 🤗 Hugging Face Transformers
+
+We recommend using the latest version of the Transformers library (at least 4.47.0). The following code snippet demonstrates how to use the **Baichuan-M1-14B-Instruct** model:
+
+```python
+from transformers import AutoModelForCausalLM, AutoTokenizer
+import torch
+# 1. Load pre-trained model and tokenizer
+model_name = "baichuan-inc/Baichuan-M1-14B-Instruct"  
+tokenizer = AutoTokenizer.from_pretrained(model_name,trust_remote_code=True)
+model = AutoModelForCausalLM.from_pretrained(model_name,trust_remote_code=True,torch_dtype = torch.bfloat16).cuda()
+# 2. Input prompt text
+prompt = "May I ask you some questions about medical knowledge?"
+
+# 3. Encode the input text for the model
+messages = [
+    {"role": "system", "content": "You are a helpful assistant."},
+    {"role": "user", "content": prompt}
+]
+text = tokenizer.apply_chat_template(
+    messages,
+    tokenize=False,
+    add_generation_prompt=True
+)
+model_inputs = tokenizer([text], return_tensors="pt").to(model.device)
+
+# 4. Generate text
+generated_ids = model.generate(
+    **model_inputs,
+    max_new_tokens=512
+)
+generated_ids = [
+    output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
+]
+
+# 5. Decode the generated text
+response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
+
+
+# 6. Output the result
+print("Generated text:")
+print(response)
+```
+
+---
+<a name="declare"></a>
+# 📜 License and Statement
+The use of the model must comply with [《Baichuan-M1-14B模型社区许可协议》](https://github.com/baichuan-inc/Baichuan-M1-14B/blob/main/Baichuan-M1-14B模型社区许可协议.pdf).
+
+The development team of Baichuan has not developed any commercial applications based on this model. All users must comply with laws and regulations and must not use the model for harmful national security or illegal purposes.
+
+---
+<a name="reference"></a>
+# 🏷️ Reference
+If you need to cite our work, please use the following reference:
+```
+@article{baichuan-m1-2025,
+  title={Baichuan-M1: Pushing the Medical Capability of Large Language Models},
+  author={Bingning Wang, Haizhou Zhao, Huozhi Zhou, Liang Song, Mingyu Xu, Wei Cheng, Xiangrong Zeng, Yupeng Zhang, Yuqi Huo, Zecheng Wang, Zhengyun Zhao and others},
+  journal={arXiv preprint arXiv:2502.12671},
+  year={2025}
+}
+```
--- a/config.json
+++ b/config.json
@ -0,0 +1,56 @@
+{
+  "architectures": [
+    "BaichuanM1ForCausalLM"
+  ],
+  "attention_dropout": 0.0,
+  "auto_map": {
+    "AutoConfig": "configuration_baichuan.BaichuanM1Config",
+    "AutoModelForCausalLM": "modeling_baichuan.BaichuanM1ForCausalLM"
+  },
+  "bos_token_id": 1,
+  "conv_window": 2,
+  "eos_token_id": 2,
+  "hidden_act": "silu",
+  "hidden_size": 5120,
+  "initializer_range": 0.02,
+  "intermediate_size": 17408,
+  "max_position_embeddings": 32768,
+  "model_max_length": 32768,
+  "model_type": "baichuan_m1",
+  "num_attention_heads": 20,
+  "num_hidden_layers": 40,
+  "num_key_value_heads": 2,
+  "num_swa_attention_heads": 40,
+  "num_swa_key_value_heads": 8,
+  "pad_token_id": 0,
+  "rms_norm_eps": 1e-06,
+  "rope_theta": 1000000.0,
+  "sliding_window": 8192,
+  "sliding_window_layers": [
+    1,
+    3,
+    5,
+    7,
+    9,
+    11,
+    13,
+    15,
+    17,
+    19,
+    21,
+    23,
+    25,
+    27,
+    29,
+    31,
+    33,
+    35,
+    37,
+    39
+  ],
+  "tie_word_embeddings": false,
+  "torch_dtype": "bfloat16",
+  "transformers_version": "4.48.1",
+  "use_cache": true,
+  "vocab_size": 133120
+}
--- a/configuration_baichuan.py
+++ b/configuration_baichuan.py
@ -0,0 +1,119 @@
+# coding=utf-8
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from transformers import PretrainedConfig
+from transformers.utils import logging
+
+logger = logging.get_logger(__name__)
+
+
+class BaichuanM1Config(PretrainedConfig):
+    r"""
+    Configuration objects inherit from [`PretrainedConfig`] and control the behavior of model outputs. For more details, 
+    refer to the documentation of [`PretrainedConfig`].
+
+    Args:
+        vocab_size (`int`, *optional*, defaults to 133120):
+            The size of the vocabulary used by the model.
+        hidden_size (`int`, *optional*, defaults to 4096):
+            The dimensionality of the hidden representations.
+        intermediate_size (`int`, *optional*, defaults to 22016):
+            The dimensionality of the intermediate (MLP) representations.
+        num_hidden_layers (`int`, *optional*, defaults to 32):
+            The number of hidden layers in the Transformer encoder.
+        num_attention_heads (`int`, *optional*, defaults to 32):
+            The number of attention heads for each attention layer in the Transformer encoder.
+        num_key_value_heads (`int`, *optional*, defaults to 32):
+            The number of key-value heads used to implement Grouped Query Attention (GQA). 
+            - If `num_key_value_heads == num_attention_heads`, the model uses Multi-Head Attention (MHA).
+            - If `num_key_value_heads == 1`, the model uses Multi-Query Attention (MQA).
+            - Otherwise, the model uses Grouped Query Attention (GQA). 
+            When converting a multi-head checkpoint to a GQA checkpoint, each group's key and value heads are constructed 
+            by mean-pooling the original heads within that group. For more details, refer to [this paper](https://arxiv.org/pdf/2305.13245.pdf). 
+            If not specified, this defaults to `32`.
+        hidden_act (`str` or `function`, *optional*, defaults to `"silu"`):
+            The non-linear activation function (either a string or a callable function) used in the decoder.
+        max_position_embeddings (`int`, *optional*, defaults to 32768):
+            The maximum sequence length the model can handle.
+        initializer_range (`float`, *optional*, defaults to 0.02):
+            The standard deviation of the truncated normal initializer for initializing all weight matrices.
+        rms_norm_eps (`float`, *optional*, defaults to 1e-06):
+            The epsilon value used by the RMS normalization layers.
+        use_cache (`bool`, *optional*, defaults to `True`):
+            Whether the model should return the last key/value attentions. This is only relevant if `config.is_decoder=True`.
+        tie_word_embeddings (`bool`, *optional*, defaults to `False`):
+            Whether to tie the model's input and output word embeddings.
+        rope_theta (`float`, *optional*, defaults to 10000.0):
+            The base period of the Rotary Position Embeddings (RoPE).
+        use_sliding_window (`bool`, *optional*, defaults to `False`):
+            Whether to enable sliding window attention.
+        sliding_window (`int`, *optional*, defaults to 4096):
+            The size of the sliding window for sliding window attention (SWA). If not specified, it defaults to `2048`.
+        attention_dropout (`float`, *optional*, defaults to 0.0):
+            The dropout ratio applied to the attention probabilities.
+    """
+
+    model_type = "baichuan"
+    keys_to_ignore_at_inference = ["past_key_values"]
+
+    def __init__(
+            self,
+            vocab_size=133120,
+            hidden_size=5120,
+            intermediate_size=17408,
+            num_hidden_layers=40,
+            num_attention_heads=40,
+            num_key_value_heads=2,
+            num_swa_attention_heads: int = 20,
+            num_swa_key_value_heads=8,
+            sliding_window_layers: list = None,
+            hidden_act="silu",
+            max_position_embeddings=32768,
+            initializer_range=0.02,
+            rms_norm_eps=1e-6,
+            use_cache=True,
+            tie_word_embeddings=False,
+            rope_theta=100000.0,
+            sliding_window=2048,
+            attention_dropout=0.0,
+            conv_window = 2,
+            **kwargs,
+    ):
+        self.sliding_window_layers = sliding_window_layers
+        self.num_swa_key_value_heads = num_swa_key_value_heads
+        self.num_swa_attention_heads = num_swa_attention_heads
+        self.vocab_size = vocab_size
+        self.max_position_embeddings = max_position_embeddings
+        self.hidden_size = hidden_size
+        self.intermediate_size = intermediate_size
+        self.num_hidden_layers = num_hidden_layers
+        self.num_attention_heads = num_attention_heads
+        self.sliding_window = sliding_window
+
+        # for backward compatibility
+        if num_key_value_heads is None:
+            num_key_value_heads = num_attention_heads
+
+        self.num_key_value_heads = num_key_value_heads
+        self.hidden_act = hidden_act
+        self.initializer_range = initializer_range
+        self.rms_norm_eps = rms_norm_eps
+        self.use_cache = use_cache
+        self.rope_theta = rope_theta
+        self.attention_dropout = attention_dropout
+        self.conv_window = conv_window
+        super().__init__(
+            tie_word_embeddings=tie_word_embeddings,
+            **kwargs,
+        )
--- a/generation_config.json
+++ b/generation_config.json
@ -0,0 +1,14 @@
+{
+  "assistant_token_id": 74,
+  "bos_token_id": 1,
+  "do_sample": true,
+  "eos_token_id": 2,
+  "max_new_tokens": 2048,
+  "pad_token_id": 0,
+  "repetition_penalty": 1.05,
+  "temperature": 0.3,
+  "top_k": 5,
+  "top_p": 0.85,
+  "transformers_version": "4.48.1",
+  "user_token_id": 73
+}
--- a/model-00001-of-00006.safetensors
+++ b/model-00001-of-00006.safetensors
--- a/model-00002-of-00006.safetensors
+++ b/model-00002-of-00006.safetensors
--- a/model-00004-of-00006.safetensors
+++ b/model-00004-of-00006.safetensors
--- a/model-00005-of-00006.safetensors
+++ b/model-00005-of-00006.safetensors
--- a/model-00006-of-00006.safetensors
+++ b/model-00006-of-00006.safetensors
--- a/model.safetensors.index.json
+++ b/model.safetensors.index.json
@ -0,0 +1,370 @@
+{
+  "metadata": {
+    "total_size": 28941528640
+  },
+  "weight_map": {
+    "lm_head.weight": "model-00006-of-00006.safetensors",
+    "model.embed_tokens.weight": "model-00001-of-00006.safetensors",
+    "model.layers.0.input_layernorm.weight": "model-00001-of-00006.safetensors",
+    "model.layers.0.mlp.down_proj.weight": "model-00001-of-00006.safetensors",
+    "model.layers.0.mlp.gate_proj.weight": "model-00001-of-00006.safetensors",
+    "model.layers.0.mlp.up_proj.weight": "model-00001-of-00006.safetensors",
+    "model.layers.0.post_attention_layernorm.weight": "model-00001-of-00006.safetensors",
+    "model.layers.0.self_attn.W_pack.weight": "model-00001-of-00006.safetensors",
+    "model.layers.0.self_attn.conv_k": "model-00001-of-00006.safetensors",
+    "model.layers.0.self_attn.conv_v": "model-00001-of-00006.safetensors",
+    "model.layers.0.self_attn.o_proj.weight": "model-00001-of-00006.safetensors",
+    "model.layers.1.input_layernorm.weight": "model-00001-of-00006.safetensors",
+    "model.layers.1.mlp.down_proj.weight": "model-00001-of-00006.safetensors",
+    "model.layers.1.mlp.gate_proj.weight": "model-00001-of-00006.safetensors",
+    "model.layers.1.mlp.up_proj.weight": "model-00001-of-00006.safetensors",
+    "model.layers.1.post_attention_layernorm.weight": "model-00001-of-00006.safetensors",
+    "model.layers.1.self_attn.W_pack.weight": "model-00001-of-00006.safetensors",
+    "model.layers.1.self_attn.conv_k": "model-00001-of-00006.safetensors",
+    "model.layers.1.self_attn.conv_v": "model-00001-of-00006.safetensors",
+    "model.layers.1.self_attn.o_proj.weight": "model-00001-of-00006.safetensors",
+    "model.layers.10.input_layernorm.weight": "model-00002-of-00006.safetensors",
+    "model.layers.10.mlp.down_proj.weight": "model-00002-of-00006.safetensors",
+    "model.layers.10.mlp.gate_proj.weight": "model-00002-of-00006.safetensors",
+    "model.layers.10.mlp.up_proj.weight": "model-00002-of-00006.safetensors",
+    "model.layers.10.post_attention_layernorm.weight": "model-00002-of-00006.safetensors",
+    "model.layers.10.self_attn.W_pack.weight": "model-00002-of-00006.safetensors",
+    "model.layers.10.self_attn.conv_k": "model-00002-of-00006.safetensors",
+    "model.layers.10.self_attn.conv_v": "model-00002-of-00006.safetensors",
+    "model.layers.10.self_attn.o_proj.weight": "model-00002-of-00006.safetensors",
+    "model.layers.11.input_layernorm.weight": "model-00002-of-00006.safetensors",
+    "model.layers.11.mlp.down_proj.weight": "model-00002-of-00006.safetensors",
+    "model.layers.11.mlp.gate_proj.weight": "model-00002-of-00006.safetensors",
+    "model.layers.11.mlp.up_proj.weight": "model-00002-of-00006.safetensors",
+    "model.layers.11.post_attention_layernorm.weight": "model-00002-of-00006.safetensors",
+    "model.layers.11.self_attn.W_pack.weight": "model-00002-of-00006.safetensors",
+    "model.layers.11.self_attn.conv_k": "model-00002-of-00006.safetensors",
+    "model.layers.11.self_attn.conv_v": "model-00002-of-00006.safetensors",
+    "model.layers.11.self_attn.o_proj.weight": "model-00002-of-00006.safetensors",
+    "model.layers.12.input_layernorm.weight": "model-00002-of-00006.safetensors",
+    "model.layers.12.mlp.down_proj.weight": "model-00002-of-00006.safetensors",
+    "model.layers.12.mlp.gate_proj.weight": "model-00002-of-00006.safetensors",
+    "model.layers.12.mlp.up_proj.weight": "model-00002-of-00006.safetensors",
+    "model.layers.12.post_attention_layernorm.weight": "model-00002-of-00006.safetensors",
+    "model.layers.12.self_attn.W_pack.weight": "model-00002-of-00006.safetensors",
+    "model.layers.12.self_attn.conv_k": "model-00002-of-00006.safetensors",
+    "model.layers.12.self_attn.conv_v": "model-00002-of-00006.safetensors",
+    "model.layers.12.self_attn.o_proj.weight": "model-00002-of-00006.safetensors",
+    "model.layers.13.input_layernorm.weight": "model-00003-of-00006.safetensors",
+    "model.layers.13.mlp.down_proj.weight": "model-00003-of-00006.safetensors",
+    "model.layers.13.mlp.gate_proj.weight": "model-00003-of-00006.safetensors",
+    "model.layers.13.mlp.up_proj.weight": "model-00003-of-00006.safetensors",
+    "model.layers.13.post_attention_layernorm.weight": "model-00003-of-00006.safetensors",
+    "model.layers.13.self_attn.W_pack.weight": "model-00003-of-00006.safetensors",
+    "model.layers.13.self_attn.conv_k": "model-00002-of-00006.safetensors",
+    "model.layers.13.self_attn.conv_v": "model-00002-of-00006.safetensors",
+    "model.layers.13.self_attn.o_proj.weight": "model-00003-of-00006.safetensors",
+    "model.layers.14.input_layernorm.weight": "model-00003-of-00006.safetensors",
+    "model.layers.14.mlp.down_proj.weight": "model-00003-of-00006.safetensors",
+    "model.layers.14.mlp.gate_proj.weight": "model-00003-of-00006.safetensors",
+    "model.layers.14.mlp.up_proj.weight": "model-00003-of-00006.safetensors",
+    "model.layers.14.post_attention_layernorm.weight": "model-00003-of-00006.safetensors",
+    "model.layers.14.self_attn.W_pack.weight": "model-00003-of-00006.safetensors",
+    "model.layers.14.self_attn.conv_k": "model-00003-of-00006.safetensors",
+    "model.layers.14.self_attn.conv_v": "model-00003-of-00006.safetensors",
+    "model.layers.14.self_attn.o_proj.weight": "model-00003-of-00006.safetensors",
+    "model.layers.15.input_layernorm.weight": "model-00003-of-00006.safetensors",
+    "model.layers.15.mlp.down_proj.weight": "model-00003-of-00006.safetensors",
+    "model.layers.15.mlp.gate_proj.weight": "model-00003-of-00006.safetensors",
+    "model.layers.15.mlp.up_proj.weight": "model-00003-of-00006.safetensors",
+    "model.layers.15.post_attention_layernorm.weight": "model-00003-of-00006.safetensors",
+    "model.layers.15.self_attn.W_pack.weight": "model-00003-of-00006.safetensors",
+    "model.layers.15.self_attn.conv_k": "model-00003-of-00006.safetensors",
+    "model.layers.15.self_attn.conv_v": "model-00003-of-00006.safetensors",
+    "model.layers.15.self_attn.o_proj.weight": "model-00003-of-00006.safetensors",
+    "model.layers.16.input_layernorm.weight": "model-00003-of-00006.safetensors",
+    "model.layers.16.mlp.down_proj.weight": "model-00003-of-00006.safetensors",
+    "model.layers.16.mlp.gate_proj.weight": "model-00003-of-00006.safetensors",
+    "model.layers.16.mlp.up_proj.weight": "model-00003-of-00006.safetensors",
+    "model.layers.16.post_attention_layernorm.weight": "model-00003-of-00006.safetensors",
+    "model.layers.16.self_attn.W_pack.weight": "model-00003-of-00006.safetensors",
+    "model.layers.16.self_attn.conv_k": "model-00003-of-00006.safetensors",
+    "model.layers.16.self_attn.conv_v": "model-00003-of-00006.safetensors",
+    "model.layers.16.self_attn.o_proj.weight": "model-00003-of-00006.safetensors",
+    "model.layers.17.input_layernorm.weight": "model-00003-of-00006.safetensors",
+    "model.layers.17.mlp.down_proj.weight": "model-00003-of-00006.safetensors",
+    "model.layers.17.mlp.gate_proj.weight": "model-00003-of-00006.safetensors",
+    "model.layers.17.mlp.up_proj.weight": "model-00003-of-00006.safetensors",
+    "model.layers.17.post_attention_layernorm.weight": "model-00003-of-00006.safetensors",
+    "model.layers.17.self_attn.W_pack.weight": "model-00003-of-00006.safetensors",
+    "model.layers.17.self_attn.conv_k": "model-00003-of-00006.safetensors",
+    "model.layers.17.self_attn.conv_v": "model-00003-of-00006.safetensors",
+    "model.layers.17.self_attn.o_proj.weight": "model-00003-of-00006.safetensors",
+    "model.layers.18.input_layernorm.weight": "model-00003-of-00006.safetensors",
+    "model.layers.18.mlp.down_proj.weight": "model-00003-of-00006.safetensors",
+    "model.layers.18.mlp.gate_proj.weight": "model-00003-of-00006.safetensors",
+    "model.layers.18.mlp.up_proj.weight": "model-00003-of-00006.safetensors",
+    "model.layers.18.post_attention_layernorm.weight": "model-00003-of-00006.safetensors",
+    "model.layers.18.self_attn.W_pack.weight": "model-00003-of-00006.safetensors",
+    "model.layers.18.self_attn.conv_k": "model-00003-of-00006.safetensors",
+    "model.layers.18.self_attn.conv_v": "model-00003-of-00006.safetensors",
+    "model.layers.18.self_attn.o_proj.weight": "model-00003-of-00006.safetensors",
+    "model.layers.19.input_layernorm.weight": "model-00003-of-00006.safetensors",
+    "model.layers.19.mlp.down_proj.weight": "model-00003-of-00006.safetensors",
+    "model.layers.19.mlp.gate_proj.weight": "model-00003-of-00006.safetensors",
+    "model.layers.19.mlp.up_proj.weight": "model-00003-of-00006.safetensors",
+    "model.layers.19.post_attention_layernorm.weight": "model-00003-of-00006.safetensors",
+    "model.layers.19.self_attn.W_pack.weight": "model-00003-of-00006.safetensors",
+    "model.layers.19.self_attn.conv_k": "model-00003-of-00006.safetensors",
+    "model.layers.19.self_attn.conv_v": "model-00003-of-00006.safetensors",
+    "model.layers.19.self_attn.o_proj.weight": "model-00003-of-00006.safetensors",
+    "model.layers.2.input_layernorm.weight": "model-00001-of-00006.safetensors",
+    "model.layers.2.mlp.down_proj.weight": "model-00001-of-00006.safetensors",
+    "model.layers.2.mlp.gate_proj.weight": "model-00001-of-00006.safetensors",
+    "model.layers.2.mlp.up_proj.weight": "model-00001-of-00006.safetensors",
+    "model.layers.2.post_attention_layernorm.weight": "model-00001-of-00006.safetensors",
+    "model.layers.2.self_attn.W_pack.weight": "model-00001-of-00006.safetensors",
+    "model.layers.2.self_attn.conv_k": "model-00001-of-00006.safetensors",
+    "model.layers.2.self_attn.conv_v": "model-00001-of-00006.safetensors",
+    "model.layers.2.self_attn.o_proj.weight": "model-00001-of-00006.safetensors",
+    "model.layers.20.input_layernorm.weight": "model-00004-of-00006.safetensors",
+    "model.layers.20.mlp.down_proj.weight": "model-00004-of-00006.safetensors",
+    "model.layers.20.mlp.gate_proj.weight": "model-00003-of-00006.safetensors",
+    "model.layers.20.mlp.up_proj.weight": "model-00004-of-00006.safetensors",
+    "model.layers.20.post_attention_layernorm.weight": "model-00004-of-00006.safetensors",
+    "model.layers.20.self_attn.W_pack.weight": "model-00003-of-00006.safetensors",
+    "model.layers.20.self_attn.conv_k": "model-00003-of-00006.safetensors",
+    "model.layers.20.self_attn.conv_v": "model-00003-of-00006.safetensors",
+    "model.layers.20.self_attn.o_proj.weight": "model-00003-of-00006.safetensors",
+    "model.layers.21.input_layernorm.weight": "model-00004-of-00006.safetensors",
+    "model.layers.21.mlp.down_proj.weight": "model-00004-of-00006.safetensors",
+    "model.layers.21.mlp.gate_proj.weight": "model-00004-of-00006.safetensors",
+    "model.layers.21.mlp.up_proj.weight": "model-00004-of-00006.safetensors",
+    "model.layers.21.post_attention_layernorm.weight": "model-00004-of-00006.safetensors",
+    "model.layers.21.self_attn.W_pack.weight": "model-00004-of-00006.safetensors",
+    "model.layers.21.self_attn.conv_k": "model-00004-of-00006.safetensors",
+    "model.layers.21.self_attn.conv_v": "model-00004-of-00006.safetensors",
+    "model.layers.21.self_attn.o_proj.weight": "model-00004-of-00006.safetensors",
+    "model.layers.22.input_layernorm.weight": "model-00004-of-00006.safetensors",
+    "model.layers.22.mlp.down_proj.weight": "model-00004-of-00006.safetensors",
+    "model.layers.22.mlp.gate_proj.weight": "model-00004-of-00006.safetensors",
+    "model.layers.22.mlp.up_proj.weight": "model-00004-of-00006.safetensors",
+    "model.layers.22.post_attention_layernorm.weight": "model-00004-of-00006.safetensors",
+    "model.layers.22.self_attn.W_pack.weight": "model-00004-of-00006.safetensors",
+    "model.layers.22.self_attn.conv_k": "model-00004-of-00006.safetensors",
+    "model.layers.22.self_attn.conv_v": "model-00004-of-00006.safetensors",
+    "model.layers.22.self_attn.o_proj.weight": "model-00004-of-00006.safetensors",
+    "model.layers.23.input_layernorm.weight": "model-00004-of-00006.safetensors",
+    "model.layers.23.mlp.down_proj.weight": "model-00004-of-00006.safetensors",
+    "model.layers.23.mlp.gate_proj.weight": "model-00004-of-00006.safetensors",
+    "model.layers.23.mlp.up_proj.weight": "model-00004-of-00006.safetensors",
+    "model.layers.23.post_attention_layernorm.weight": "model-00004-of-00006.safetensors",
+    "model.layers.23.self_attn.W_pack.weight": "model-00004-of-00006.safetensors",
+    "model.layers.23.self_attn.conv_k": "model-00004-of-00006.safetensors",
+    "model.layers.23.self_attn.conv_v": "model-00004-of-00006.safetensors",
+    "model.layers.23.self_attn.o_proj.weight": "model-00004-of-00006.safetensors",
+    "model.layers.24.input_layernorm.weight": "model-00004-of-00006.safetensors",
+    "model.layers.24.mlp.down_proj.weight": "model-00004-of-00006.safetensors",
+    "model.layers.24.mlp.gate_proj.weight": "model-00004-of-00006.safetensors",
+    "model.layers.24.mlp.up_proj.weight": "model-00004-of-00006.safetensors",
+    "model.layers.24.post_attention_layernorm.weight": "model-00004-of-00006.safetensors",
+    "model.layers.24.self_attn.W_pack.weight": "model-00004-of-00006.safetensors",
+    "model.layers.24.self_attn.conv_k": "model-00004-of-00006.safetensors",
+    "model.layers.24.self_attn.conv_v": "model-00004-of-00006.safetensors",
+    "model.layers.24.self_attn.o_proj.weight": "model-00004-of-00006.safetensors",
+    "model.layers.25.input_layernorm.weight": "model-00004-of-00006.safetensors",
+    "model.layers.25.mlp.down_proj.weight": "model-00004-of-00006.safetensors",
+    "model.layers.25.mlp.gate_proj.weight": "model-00004-of-00006.safetensors",
+    "model.layers.25.mlp.up_proj.weight": "model-00004-of-00006.safetensors",
+    "model.layers.25.post_attention_layernorm.weight": "model-00004-of-00006.safetensors",
+    "model.layers.25.self_attn.W_pack.weight": "model-00004-of-00006.safetensors",
+    "model.layers.25.self_attn.conv_k": "model-00004-of-00006.safetensors",
+    "model.layers.25.self_attn.conv_v": "model-00004-of-00006.safetensors",
+    "model.layers.25.self_attn.o_proj.weight": "model-00004-of-00006.safetensors",
+    "model.layers.26.input_layernorm.weight": "model-00004-of-00006.safetensors",
+    "model.layers.26.mlp.down_proj.weight": "model-00004-of-00006.safetensors",
+    "model.layers.26.mlp.gate_proj.weight": "model-00004-of-00006.safetensors",
+    "model.layers.26.mlp.up_proj.weight": "model-00004-of-00006.safetensors",
+    "model.layers.26.post_attention_layernorm.weight": "model-00004-of-00006.safetensors",
+    "model.layers.26.self_attn.W_pack.weight": "model-00004-of-00006.safetensors",
+    "model.layers.26.self_attn.conv_k": "model-00004-of-00006.safetensors",
+    "model.layers.26.self_attn.conv_v": "model-00004-of-00006.safetensors",
+    "model.layers.26.self_attn.o_proj.weight": "model-00004-of-00006.safetensors",
+    "model.layers.27.input_layernorm.weight": "model-00004-of-00006.safetensors",
+    "model.layers.27.mlp.down_proj.weight": "model-00004-of-00006.safetensors",
+    "model.layers.27.mlp.gate_proj.weight": "model-00004-of-00006.safetensors",
+    "model.layers.27.mlp.up_proj.weight": "model-00004-of-00006.safetensors",
+    "model.layers.27.post_attention_layernorm.weight": "model-00004-of-00006.safetensors",
+    "model.layers.27.self_attn.W_pack.weight": "model-00004-of-00006.safetensors",
+    "model.layers.27.self_attn.conv_k": "model-00004-of-00006.safetensors",
+    "model.layers.27.self_attn.conv_v": "model-00004-of-00006.safetensors",
+    "model.layers.27.self_attn.o_proj.weight": "model-00004-of-00006.safetensors",
+    "model.layers.28.input_layernorm.weight": "model-00005-of-00006.safetensors",
+    "model.layers.28.mlp.down_proj.weight": "model-00005-of-00006.safetensors",
+    "model.layers.28.mlp.gate_proj.weight": "model-00005-of-00006.safetensors",
+    "model.layers.28.mlp.up_proj.weight": "model-00005-of-00006.safetensors",
+    "model.layers.28.post_attention_layernorm.weight": "model-00005-of-00006.safetensors",
+    "model.layers.28.self_attn.W_pack.weight": "model-00005-of-00006.safetensors",
+    "model.layers.28.self_attn.conv_k": "model-00004-of-00006.safetensors",
+    "model.layers.28.self_attn.conv_v": "model-00004-of-00006.safetensors",
+    "model.layers.28.self_attn.o_proj.weight": "model-00005-of-00006.safetensors",
+    "model.layers.29.input_layernorm.weight": "model-00005-of-00006.safetensors",
+    "model.layers.29.mlp.down_proj.weight": "model-00005-of-00006.safetensors",
+    "model.layers.29.mlp.gate_proj.weight": "model-00005-of-00006.safetensors",
+    "model.layers.29.mlp.up_proj.weight": "model-00005-of-00006.safetensors",
+    "model.layers.29.post_attention_layernorm.weight": "model-00005-of-00006.safetensors",
+    "model.layers.29.self_attn.W_pack.weight": "model-00005-of-00006.safetensors",
+    "model.layers.29.self_attn.conv_k": "model-00005-of-00006.safetensors",
+    "model.layers.29.self_attn.conv_v": "model-00005-of-00006.safetensors",
+    "model.layers.29.self_attn.o_proj.weight": "model-00005-of-00006.safetensors",
+    "model.layers.3.input_layernorm.weight": "model-00001-of-00006.safetensors",
+    "model.layers.3.mlp.down_proj.weight": "model-00001-of-00006.safetensors",
+    "model.layers.3.mlp.gate_proj.weight": "model-00001-of-00006.safetensors",
+    "model.layers.3.mlp.up_proj.weight": "model-00001-of-00006.safetensors",
+    "model.layers.3.post_attention_layernorm.weight": "model-00001-of-00006.safetensors",
+    "model.layers.3.self_attn.W_pack.weight": "model-00001-of-00006.safetensors",
+    "model.layers.3.self_attn.conv_k": "model-00001-of-00006.safetensors",
+    "model.layers.3.self_attn.conv_v": "model-00001-of-00006.safetensors",
+    "model.layers.3.self_attn.o_proj.weight": "model-00001-of-00006.safetensors",
+    "model.layers.30.input_layernorm.weight": "model-00005-of-00006.safetensors",
+    "model.layers.30.mlp.down_proj.weight": "model-00005-of-00006.safetensors",
+    "model.layers.30.mlp.gate_proj.weight": "model-00005-of-00006.safetensors",
+    "model.layers.30.mlp.up_proj.weight": "model-00005-of-00006.safetensors",
+    "model.layers.30.post_attention_layernorm.weight": "model-00005-of-00006.safetensors",
+    "model.layers.30.self_attn.W_pack.weight": "model-00005-of-00006.safetensors",
+    "model.layers.30.self_attn.conv_k": "model-00005-of-00006.safetensors",
+    "model.layers.30.self_attn.conv_v": "model-00005-of-00006.safetensors",
+    "model.layers.30.self_attn.o_proj.weight": "model-00005-of-00006.safetensors",
+    "model.layers.31.input_layernorm.weight": "model-00005-of-00006.safetensors",
+    "model.layers.31.mlp.down_proj.weight": "model-00005-of-00006.safetensors",
+    "model.layers.31.mlp.gate_proj.weight": "model-00005-of-00006.safetensors",
+    "model.layers.31.mlp.up_proj.weight": "model-00005-of-00006.safetensors",
+    "model.layers.31.post_attention_layernorm.weight": "model-00005-of-00006.safetensors",
+    "model.layers.31.self_attn.W_pack.weight": "model-00005-of-00006.safetensors",
+    "model.layers.31.self_attn.conv_k": "model-00005-of-00006.safetensors",
+    "model.layers.31.self_attn.conv_v": "model-00005-of-00006.safetensors",
+    "model.layers.31.self_attn.o_proj.weight": "model-00005-of-00006.safetensors",
+    "model.layers.32.input_layernorm.weight": "model-00005-of-00006.safetensors",
+    "model.layers.32.mlp.down_proj.weight": "model-00005-of-00006.safetensors",
+    "model.layers.32.mlp.gate_proj.weight": "model-00005-of-00006.safetensors",
+    "model.layers.32.mlp.up_proj.weight": "model-00005-of-00006.safetensors",
+    "model.layers.32.post_attention_layernorm.weight": "model-00005-of-00006.safetensors",
+    "model.layers.32.self_attn.W_pack.weight": "model-00005-of-00006.safetensors",
+    "model.layers.32.self_attn.conv_k": "model-00005-of-00006.safetensors",
+    "model.layers.32.self_attn.conv_v": "model-00005-of-00006.safetensors",
+    "model.layers.32.self_attn.o_proj.weight": "model-00005-of-00006.safetensors",
+    "model.layers.33.input_layernorm.weight": "model-00005-of-00006.safetensors",
+    "model.layers.33.mlp.down_proj.weight": "model-00005-of-00006.safetensors",
+    "model.layers.33.mlp.gate_proj.weight": "model-00005-of-00006.safetensors",
+    "model.layers.33.mlp.up_proj.weight": "model-00005-of-00006.safetensors",
+    "model.layers.33.post_attention_layernorm.weight": "model-00005-of-00006.safetensors",
+    "model.layers.33.self_attn.W_pack.weight": "model-00005-of-00006.safetensors",
+    "model.layers.33.self_attn.conv_k": "model-00005-of-00006.safetensors",
+    "model.layers.33.self_attn.conv_v": "model-00005-of-00006.safetensors",
+    "model.layers.33.self_attn.o_proj.weight": "model-00005-of-00006.safetensors",
+    "model.layers.34.input_layernorm.weight": "model-00005-of-00006.safetensors",
+    "model.layers.34.mlp.down_proj.weight": "model-00005-of-00006.safetensors",
+    "model.layers.34.mlp.gate_proj.weight": "model-00005-of-00006.safetensors",
+    "model.layers.34.mlp.up_proj.weight": "model-00005-of-00006.safetensors",
+    "model.layers.34.post_attention_layernorm.weight": "model-00005-of-00006.safetensors",
+    "model.layers.34.self_attn.W_pack.weight": "model-00005-of-00006.safetensors",
+    "model.layers.34.self_attn.conv_k": "model-00005-of-00006.safetensors",
+    "model.layers.34.self_attn.conv_v": "model-00005-of-00006.safetensors",
+    "model.layers.34.self_attn.o_proj.weight": "model-00005-of-00006.safetensors",
+    "model.layers.35.input_layernorm.weight": "model-00006-of-00006.safetensors",
+    "model.layers.35.mlp.down_proj.weight": "model-00006-of-00006.safetensors",
+    "model.layers.35.mlp.gate_proj.weight": "model-00005-of-00006.safetensors",
+    "model.layers.35.mlp.up_proj.weight": "model-00006-of-00006.safetensors",
+    "model.layers.35.post_attention_layernorm.weight": "model-00006-of-00006.safetensors",
+    "model.layers.35.self_attn.W_pack.weight": "model-00005-of-00006.safetensors",
+    "model.layers.35.self_attn.conv_k": "model-00005-of-00006.safetensors",
+    "model.layers.35.self_attn.conv_v": "model-00005-of-00006.safetensors",
+    "model.layers.35.self_attn.o_proj.weight": "model-00005-of-00006.safetensors",
+    "model.layers.36.input_layernorm.weight": "model-00006-of-00006.safetensors",
+    "model.layers.36.mlp.down_proj.weight": "model-00006-of-00006.safetensors",
+    "model.layers.36.mlp.gate_proj.weight": "model-00006-of-00006.safetensors",
+    "model.layers.36.mlp.up_proj.weight": "model-00006-of-00006.safetensors",
+    "model.layers.36.post_attention_layernorm.weight": "model-00006-of-00006.safetensors",
+    "model.layers.36.self_attn.W_pack.weight": "model-00006-of-00006.safetensors",
+    "model.layers.36.self_attn.conv_k": "model-00006-of-00006.safetensors",
+    "model.layers.36.self_attn.conv_v": "model-00006-of-00006.safetensors",
+    "model.layers.36.self_attn.o_proj.weight": "model-00006-of-00006.safetensors",
+    "model.layers.37.input_layernorm.weight": "model-00006-of-00006.safetensors",
+    "model.layers.37.mlp.down_proj.weight": "model-00006-of-00006.safetensors",
+    "model.layers.37.mlp.gate_proj.weight": "model-00006-of-00006.safetensors",
+    "model.layers.37.mlp.up_proj.weight": "model-00006-of-00006.safetensors",
+    "model.layers.37.post_attention_layernorm.weight": "model-00006-of-00006.safetensors",
+    "model.layers.37.self_attn.W_pack.weight": "model-00006-of-00006.safetensors",
+    "model.layers.37.self_attn.conv_k": "model-00006-of-00006.safetensors",
+    "model.layers.37.self_attn.conv_v": "model-00006-of-00006.safetensors",
+    "model.layers.37.self_attn.o_proj.weight": "model-00006-of-00006.safetensors",
+    "model.layers.38.input_layernorm.weight": "model-00006-of-00006.safetensors",
+    "model.layers.38.mlp.down_proj.weight": "model-00006-of-00006.safetensors",
+    "model.layers.38.mlp.gate_proj.weight": "model-00006-of-00006.safetensors",
+    "model.layers.38.mlp.up_proj.weight": "model-00006-of-00006.safetensors",
+    "model.layers.38.post_attention_layernorm.weight": "model-00006-of-00006.safetensors",
+    "model.layers.38.self_attn.W_pack.weight": "model-00006-of-00006.safetensors",
+    "model.layers.38.self_attn.conv_k": "model-00006-of-00006.safetensors",
+    "model.layers.38.self_attn.conv_v": "model-00006-of-00006.safetensors",
+    "model.layers.38.self_attn.o_proj.weight": "model-00006-of-00006.safetensors",
+    "model.layers.39.input_layernorm.weight": "model-00006-of-00006.safetensors",
+    "model.layers.39.mlp.down_proj.weight": "model-00006-of-00006.safetensors",
+    "model.layers.39.mlp.gate_proj.weight": "model-00006-of-00006.safetensors",
+    "model.layers.39.mlp.up_proj.weight": "model-00006-of-00006.safetensors",
+    "model.layers.39.post_attention_layernorm.weight": "model-00006-of-00006.safetensors",
+    "model.layers.39.self_attn.W_pack.weight": "model-00006-of-00006.safetensors",
+    "model.layers.39.self_attn.conv_k": "model-00006-of-00006.safetensors",
+    "model.layers.39.self_attn.conv_v": "model-00006-of-00006.safetensors",
+    "model.layers.39.self_attn.o_proj.weight": "model-00006-of-00006.safetensors",
+    "model.layers.4.input_layernorm.weight": "model-00001-of-00006.safetensors",
+    "model.layers.4.mlp.down_proj.weight": "model-00001-of-00006.safetensors",
+    "model.layers.4.mlp.gate_proj.weight": "model-00001-of-00006.safetensors",
+    "model.layers.4.mlp.up_proj.weight": "model-00001-of-00006.safetensors",
+    "model.layers.4.post_attention_layernorm.weight": "model-00001-of-00006.safetensors",
+    "model.layers.4.self_attn.W_pack.weight": "model-00001-of-00006.safetensors",
+    "model.layers.4.self_attn.conv_k": "model-00001-of-00006.safetensors",
+    "model.layers.4.self_attn.conv_v": "model-00001-of-00006.safetensors",
+    "model.layers.4.self_attn.o_proj.weight": "model-00001-of-00006.safetensors",
+    "model.layers.5.input_layernorm.weight": "model-00002-of-00006.safetensors",
+    "model.layers.5.mlp.down_proj.weight": "model-00002-of-00006.safetensors",
+    "model.layers.5.mlp.gate_proj.weight": "model-00001-of-00006.safetensors",
+    "model.layers.5.mlp.up_proj.weight": "model-00002-of-00006.safetensors",
+    "model.layers.5.post_attention_layernorm.weight": "model-00002-of-00006.safetensors",
+    "model.layers.5.self_attn.W_pack.weight": "model-00001-of-00006.safetensors",
+    "model.layers.5.self_attn.conv_k": "model-00001-of-00006.safetensors",
+    "model.layers.5.self_attn.conv_v": "model-00001-of-00006.safetensors",
+    "model.layers.5.self_attn.o_proj.weight": "model-00001-of-00006.safetensors",
+    "model.layers.6.input_layernorm.weight": "model-00002-of-00006.safetensors",
+    "model.layers.6.mlp.down_proj.weight": "model-00002-of-00006.safetensors",
+    "model.layers.6.mlp.gate_proj.weight": "model-00002-of-00006.safetensors",
+    "model.layers.6.mlp.up_proj.weight": "model-00002-of-00006.safetensors",
+    "model.layers.6.post_attention_layernorm.weight": "model-00002-of-00006.safetensors",
+    "model.layers.6.self_attn.W_pack.weight": "model-00002-of-00006.safetensors",
+    "model.layers.6.self_attn.conv_k": "model-00002-of-00006.safetensors",
+    "model.layers.6.self_attn.conv_v": "model-00002-of-00006.safetensors",
+    "model.layers.6.self_attn.o_proj.weight": "model-00002-of-00006.safetensors",
+    "model.layers.7.input_layernorm.weight": "model-00002-of-00006.safetensors",
+    "model.layers.7.mlp.down_proj.weight": "model-00002-of-00006.safetensors",
+    "model.layers.7.mlp.gate_proj.weight": "model-00002-of-00006.safetensors",
+    "model.layers.7.mlp.up_proj.weight": "model-00002-of-00006.safetensors",
+    "model.layers.7.post_attention_layernorm.weight": "model-00002-of-00006.safetensors",
+    "model.layers.7.self_attn.W_pack.weight": "model-00002-of-00006.safetensors",
+    "model.layers.7.self_attn.conv_k": "model-00002-of-00006.safetensors",
+    "model.layers.7.self_attn.conv_v": "model-00002-of-00006.safetensors",
+    "model.layers.7.self_attn.o_proj.weight": "model-00002-of-00006.safetensors",
+    "model.layers.8.input_layernorm.weight": "model-00002-of-00006.safetensors",
+    "model.layers.8.mlp.down_proj.weight": "model-00002-of-00006.safetensors",
+    "model.layers.8.mlp.gate_proj.weight": "model-00002-of-00006.safetensors",
+    "model.layers.8.mlp.up_proj.weight": "model-00002-of-00006.safetensors",
+    "model.layers.8.post_attention_layernorm.weight": "model-00002-of-00006.safetensors",
+    "model.layers.8.self_attn.W_pack.weight": "model-00002-of-00006.safetensors",
+    "model.layers.8.self_attn.conv_k": "model-00002-of-00006.safetensors",
+    "model.layers.8.self_attn.conv_v": "model-00002-of-00006.safetensors",
+    "model.layers.8.self_attn.o_proj.weight": "model-00002-of-00006.safetensors",
+    "model.layers.9.input_layernorm.weight": "model-00002-of-00006.safetensors",
+    "model.layers.9.mlp.down_proj.weight": "model-00002-of-00006.safetensors",
+    "model.layers.9.mlp.gate_proj.weight": "model-00002-of-00006.safetensors",
+    "model.layers.9.mlp.up_proj.weight": "model-00002-of-00006.safetensors",
+    "model.layers.9.post_attention_layernorm.weight": "model-00002-of-00006.safetensors",
+    "model.layers.9.self_attn.W_pack.weight": "model-00002-of-00006.safetensors",
+    "model.layers.9.self_attn.conv_k": "model-00002-of-00006.safetensors",
+    "model.layers.9.self_attn.conv_v": "model-00002-of-00006.safetensors",
+    "model.layers.9.self_attn.o_proj.weight": "model-00002-of-00006.safetensors",
+    "model.norm.weight": "model-00006-of-00006.safetensors"
+  }
+}
--- a/modeling_baichuan.py
+++ b/modeling_baichuan.py
--- a/special_tokens_map.json
+++ b/special_tokens_map.json
@ -0,0 +1,46 @@
+{
+  "additional_special_tokens": [
+    "<|im_start|>",
+    "<|im_end|>",
+    "<|object_ref_start|>",
+    "<|object_ref_end|>",
+    "<|box_start|>",
+    "<|box_end|>",
+    "<|quad_start|>",
+    "<|quad_end|>",
+    "<|vision_start|>",
+    "<|vision_end|>",
+    "<|vision_pad|>",
+    "<|image_pad|>",
+    "<|video_pad|>",
+    "<B_SYS>",
+    "<B_USYS>",
+    "<C_Q>",
+    "<C_A>",
+    "<|im_sep|>",
+    "<|tool_call|>",
+    "<|arguments|>"
+  ],
+  "bos_token": {
+    "content": "<s>",
+    "lstrip": false,
+    "normalized": true,
+    "rstrip": false,
+    "single_word": false
+  },
+  "eos_token": {
+    "content": "</s>",
+    "lstrip": false,
+    "normalized": true,
+    "rstrip": false,
+    "single_word": false
+  },
+  "pad_token": "<pad>",
+  "unk_token": {
+    "content": "<unk>",
+    "lstrip": false,
+    "normalized": true,
+    "rstrip": false,
+    "single_word": false
+  }
+}
--- a/tokenization_baichuan.py
+++ b/tokenization_baichuan.py
@ -0,0 +1,231 @@
+import os
+from shutil import copyfile
+from typing import Any, Dict, List, Optional, Tuple
+
+import sentencepiece as spm
+
+from transformers.tokenization_utils import AddedToken, PreTrainedTokenizer
+from transformers.utils import logging
+
+
+logger = logging.get_logger(__name__)
+
+VOCAB_FILES_NAMES = {"vocab_file": "tokenizer.model"}
+
+PRETRAINED_VOCAB_FILES_MAP = {
+    "vocab_file": {},
+    "tokenizer_file": {},
+}
+PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES = {}
+
+
+class BaichuanTokenizer(PreTrainedTokenizer):
+    """
+    Construct a Baichuan tokenizer. Based on byte-level Byte-Pair-Encoding.
+
+    Args:
+        vocab_file (`str`):
+            Path to the vocabulary file.
+    """
+
+    vocab_files_names = VOCAB_FILES_NAMES
+    pretrained_vocab_files_map = PRETRAINED_VOCAB_FILES_MAP
+    max_model_input_sizes = PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES
+    model_input_names = ["input_ids", "attention_mask"]
+
+    def __init__(
+        self,
+        vocab_file,
+        unk_token="<unk>",
+        bos_token="<s>",
+        eos_token="</s>",
+        pad_token=None,
+        sp_model_kwargs: Optional[Dict[str, Any]] = None,
+        add_bos_token=True,
+        add_eos_token=False,
+        clean_up_tokenization_spaces=False,
+        **kwargs,
+    ):
+        self.sp_model_kwargs = {} if sp_model_kwargs is None else sp_model_kwargs
+        bos_token = AddedToken(bos_token, lstrip=False, rstrip=False) if isinstance(bos_token, str) else bos_token
+        eos_token = AddedToken(eos_token, lstrip=False, rstrip=False) if isinstance(eos_token, str) else eos_token
+        unk_token = AddedToken(unk_token, lstrip=False, rstrip=False) if isinstance(unk_token, str) else unk_token
+        pad_token = AddedToken(pad_token, lstrip=False, rstrip=False) if isinstance(pad_token, str) else pad_token
+        self.vocab_file = vocab_file
+        self.add_bos_token = add_bos_token
+        self.add_eos_token = add_eos_token
+        self.sp_model = spm.SentencePieceProcessor(**self.sp_model_kwargs)
+        self.sp_model.Load(vocab_file)
+        super().__init__(
+            bos_token=bos_token,
+            eos_token=eos_token,
+            unk_token=unk_token,
+            pad_token=pad_token,
+            add_bos_token=add_bos_token,
+            add_eos_token=add_eos_token,
+            sp_model_kwargs=self.sp_model_kwargs,
+            clean_up_tokenization_spaces=clean_up_tokenization_spaces,
+            **kwargs,
+        )
+        self.pad_token_id = self._convert_token_to_id(self.pad_token)
+
+    def __getstate__(self):
+        state = self.__dict__.copy()
+        state["sp_model"] = None
+        return state
+
+    def __setstate__(self, d):
+        self.__dict__ = d
+        self.sp_model = spm.SentencePieceProcessor(**self.sp_model_kwargs)
+        self.sp_model.Load(self.vocab_file)
+
+    @property
+    def vocab_size(self):
+        """Returns vocab size"""
+        return self.sp_model.get_piece_size()
+
+    def get_vocab(self):
+        """Returns vocab as a dict"""
+        vocab = {self.convert_ids_to_tokens(i): i for i in range(self.vocab_size)}
+        vocab.update(self.added_tokens_encoder)
+        return vocab
+
+    def _tokenize(self, text):
+        """Returns a tokenized string."""
+        return self.sp_model.encode(text, out_type=str)
+
+    def _convert_token_to_id(self, token):
+        """Converts a token (str) in an id using the vocab."""
+        return self.sp_model.piece_to_id(token)
+
+    def _convert_id_to_token(self, index):
+        """Converts an index (integer) in a token (str) using the vocab."""
+        token = self.sp_model.IdToPiece(index)
+        return token
+
+    def convert_tokens_to_string(self, tokens):
+        """Converts a sequence of tokens (string) in a single string."""
+        current_sub_tokens = []
+        out_string = ""
+        prev_is_special = False
+        for i, token in enumerate(tokens):
+            # make sure that special tokens are not decoded using sentencepiece model
+            if token in self.all_special_tokens:
+                if not prev_is_special and i != 0:
+                    out_string += " "
+                out_string += self.sp_model.decode(current_sub_tokens) + token
+                prev_is_special = True
+                current_sub_tokens = []
+            else:
+                current_sub_tokens.append(token)
+                prev_is_special = False
+        out_string += self.sp_model.decode(current_sub_tokens)
+        return out_string
+
+    def save_vocabulary(self, save_directory, filename_prefix: Optional[str] = None) -> Tuple[str]:
+        """
+        Save the vocabulary and special tokens file to a directory.
+
+        Args:
+            save_directory (`str`):
+                The directory in which to save the vocabulary.
+
+        Returns:
+            `Tuple(str)`: Paths to the files saved.
+        """
+        if not os.path.isdir(save_directory):
+            logger.error(f"Vocabulary path ({save_directory}) should be a directory")
+            return
+        out_vocab_file = os.path.join(
+            save_directory, (filename_prefix + "-" if filename_prefix else "") + VOCAB_FILES_NAMES["vocab_file"]
+        )
+
+        if os.path.abspath(self.vocab_file) != os.path.abspath(out_vocab_file) and os.path.isfile(self.vocab_file):
+            copyfile(self.vocab_file, out_vocab_file)
+        elif not os.path.isfile(self.vocab_file):
+            with open(out_vocab_file, "wb") as fi:
+                content_spiece_model = self.sp_model.serialized_model_proto()
+                fi.write(content_spiece_model)
+
+        return (out_vocab_file,)
+
+    def build_inputs_with_special_tokens(self, token_ids_0, token_ids_1=None):
+        bos_token_id = [self.bos_token_id] if self.add_bos_token else []
+        eos_token_id = [self.eos_token_id] if self.add_eos_token else []
+
+        output = bos_token_id + token_ids_0 + eos_token_id
+
+        if token_ids_1 is not None:
+            output = output + bos_token_id + token_ids_1 + eos_token_id
+
+        return output
+
+    def get_special_tokens_mask(
+        self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None, already_has_special_tokens: bool = False
+    ) -> List[int]:
+        """
+        Retrieve sequence ids from a token list that has no special tokens added. This method is called when adding
+        special tokens using the tokenizer `prepare_for_model` method.
+
+        Args:
+            token_ids_0 (`List[int]`):
+                List of IDs.
+            token_ids_1 (`List[int]`, *optional*):
+                Optional second list of IDs for sequence pairs.
+            already_has_special_tokens (`bool`, *optional*, defaults to `False`):
+                Whether or not the token list is already formatted with special tokens for the model.
+
+        Returns:
+            `List[int]`: A list of integers in the range [0, 1]: 1 for a special token, 0 for a sequence token.
+        """
+        if already_has_special_tokens:
+            return super().get_special_tokens_mask(
+                token_ids_0=token_ids_0, token_ids_1=token_ids_1, already_has_special_tokens=True
+            )
+
+        bos_token_id = [1] if self.add_bos_token else []
+        eos_token_id = [1] if self.add_eos_token else []
+
+        if token_ids_1 is None:
+            return bos_token_id + ([0] * len(token_ids_0)) + eos_token_id
+        return (
+            bos_token_id
+            + ([0] * len(token_ids_0))
+            + eos_token_id
+            + bos_token_id
+            + ([0] * len(token_ids_1))
+            + eos_token_id
+        )
+
+    def create_token_type_ids_from_sequences(
+        self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None
+    ) -> List[int]:
+        """
+        Creates a mask from the two sequences passed to be used in a sequence-pair classification task. An ALBERT
+        sequence pair mask has the following format:
+
+        ```
+        0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1
+        | first sequence    | second sequence |
+        ```
+
+        if token_ids_1 is None, only returns the first portion of the mask (0s).
+
+        Args:
+            token_ids_0 (`List[int]`):
+                List of ids.
+            token_ids_1 (`List[int]`, *optional*):
+                Optional second list of IDs for sequence pairs.
+
+        Returns:
+            `List[int]`: List of [token type IDs](../glossary#token-type-ids) according to the given sequence(s).
+        """
+        bos_token_id = [self.bos_token_id] if self.add_bos_token else []
+        eos_token_id = [self.eos_token_id] if self.add_eos_token else []
+
+        output = [0] * len(bos_token_id + token_ids_0 + eos_token_id)
+
+        if token_ids_1 is not None:
+            output += [1] * len(bos_token_id + token_ids_1 + eos_token_id)
+
+        return output
--- a/tokenizer.model
+++ b/tokenizer.model
--- a/tokenizer_config.json
+++ b/tokenizer_config.json
@ -0,0 +1,389 @@
+{
+  "add_bos_token": false,
+  "add_eos_token": false,
+  "added_tokens_decoder": {
+    "0": {
+      "content": "<pad>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "1": {
+      "content": "<s>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "2": {
+      "content": "</s>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "3": {
+      "content": "<unk>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "50": {
+      "content": "<|im_start|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "51": {
+      "content": "<|im_end|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "52": {
+      "content": "<|object_ref_start|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "53": {
+      "content": "<|object_ref_end|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "54": {
+      "content": "<|box_start|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "55": {
+      "content": "<|box_end|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "56": {
+      "content": "<|quad_start|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "57": {
+      "content": "<|quad_end|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "58": {
+      "content": "<|vision_start|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "59": {
+      "content": "<|vision_end|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "60": {
+      "content": "<|vision_pad|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "61": {
+      "content": "<|image_pad|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "62": {
+      "content": "<|video_pad|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "63": {
+      "content": "<tool_call>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "64": {
+      "content": "</tool_call>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "65": {
+      "content": "<|fim_prefix|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "66": {
+      "content": "<|fim_middle|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "67": {
+      "content": "<|fim_suffix|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "68": {
+      "content": "<|fim_pad|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "69": {
+      "content": "<|repo_name|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "70": {
+      "content": "<|file_sep|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "71": {
+      "content": "<B_SYS>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "72": {
+      "content": "<B_USYS>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "73": {
+      "content": "<C_Q>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "74": {
+      "content": "<C_A>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "75": {
+      "content": "<B_FUNC>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "76": {
+      "content": "<B_CODE>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "77": {
+      "content": "<B_APE>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "78": {
+      "content": "<function_calling>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "79": {
+      "content": "<calc_start>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "80": {
+      "content": "<calc_end>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "81": {
+      "content": "<inner_think>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "82": {
+      "content": "<|im_sep|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "83": {
+      "content": "<|tool_call|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "84": {
+      "content": "<|arguments|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "85": {
+      "content": "<|o1_step|>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "86": {
+      "content": "<|o1_answer|>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "87": {
+      "content": "<tree_node>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "88": {
+      "content": "</tree_node>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    }
+  },
+  "additional_special_tokens": [
+    "<|im_start|>",
+    "<|im_end|>",
+    "<|object_ref_start|>",
+    "<|object_ref_end|>",
+    "<|box_start|>",
+    "<|box_end|>",
+    "<|quad_start|>",
+    "<|quad_end|>",
+    "<|vision_start|>",
+    "<|vision_end|>",
+    "<|vision_pad|>",
+    "<|image_pad|>",
+    "<|video_pad|>",
+    "<B_SYS>",
+    "<B_USYS>",
+    "<C_Q>",
+    "<C_A>",
+    "<|im_sep|>",
+    "<|tool_call|>",
+    "<|arguments|>"
+  ],
+  "auto_map": {
+    "AutoTokenizer": [
+      "tokenization_baichuan.BaichuanTokenizer",
+      null
+    ]
+  },
+  "bos_token": "<s>",
+  "chat_template": "{% for message in messages %}{% if message['role'] == 'system' %}{{'<B_SYS>' + message['content']}}{% elif message['role'] == 'user_system' %}{{'<B_USYS>' + message['content']}}{% elif message['role'] == 'user' %}{{'<C_Q>' + message['content']}}{% elif message['role'] == 'assistant' %}{{'<C_A>' + message['content']}}{% elif message['role'] == 'function' %}{{'<B_FUNC>' + message['content']}}{% elif message['role'] == 'code' %}{{'<B_CODE>' + message['content']}}{% else %}{{ raise_exception('Invalid message role: ' + message['role']) }}{% endif %}{% endfor %}{% if add_generation_prompt %}{{'<C_A>'}}{% endif %}",
+  "clean_up_tokenization_spaces": false,
+  "eos_token": "</s>",
+  "extra_special_tokens": {},
+  "model_max_length": 32768,
+  "pad_token": "<pad>",
+  "sp_model_kwargs": {},
+  "tokenizer_class": "BaichuanTokenizer",
+  "unk_token": "<unk>",
+  "use_fast": false
+}