diff --git a/README.md b/README.md
index 64f8400..352c8e3 100644
--- a/README.md
+++ b/README.md
@@ -1,3 +1,399 @@
-# Baichuan-M1-14B-Instruct
+---
+language:
+- en
+- zh
+tags:
+- medical
+---
+<div align="center">
+<h1>
+  Baichuan-M1-14B-Instruct
+</h1>
+</div>
 
-Baichuan-M1-14B-Instruct
\ No newline at end of file
+<p align="center">
+🤗 <a href="https://huggingface.co/baichuan-inc/Baichuan-M1-14B-Base" target="_blank">Baichuan-M1-14B-Base</a> • 🤗 <a href="https://huggingface.co/baichuan-inc/Baichuan-M1-14B-Instruct" target="_blank">Baichuan-M1-14B-Instruct</a> • 📗 <a href="https://arxiv.org/abs/2502.12671" target="_blank">Technical Report</a> • 💬 <a href="https://y41.8if.cn/JQCj6n" target="_blank">WeChat</a>
+</p>
+
+
+---
+
+# 📖 Table of Contents
+
+- [🏁 Model Introduction](#intro)
+- [🔬 Data Collection and Processing](#data)
+- [🧠 New Model Architecture](#structure)
+- [⚙️ Training Methodology](#training)
+- [📊 Benchmark Results](#benchmark)
+- [🚀 Quick Start](#quick)
+- [📜 License and Statement](#declare)
+- [🏷️ Reference](#reference)
+
+---
+<a name="intro"></a>
+# 🏁 Model Introduction
+
+**Baichuan-14B-M1** is the industry's first open-source large language model developed from scratch by Baichuan Intelligence, specifically optimized for medical scenarios. While excelling in general capabilities, it demonstrates powerful performance in the medical field. It achieves results comparable to models of similar size in most general benchmark evaluations, while outperforming models five times larger in medical scenarios. Below are the core features of the model:
+
+- Trained from scratch on **20 trillion tokens** of high-quality medical and general data.
+- Specialized modeling for **20+ medical departments** with fine-grained medical expertise.
+- Introduces **innovative model architecture**, significantly improving context understanding and long-sequence task performance.
+- Provides **[🤗 Base Model](https://huggingface.co/baichuan-inc/Baichuan-M1-14B-Base)** and **[🤗 Instruct Model](https://huggingface.co/baichuan-inc/Baichuan-M1-14B-Instruct)**.
+
+
+---
+<a name="data"></a>
+# 🔬 Data Collection and Processing
+
+## Medical Data Collection
+
+We conducted meticulous data collection and synthesis for the medical field, including:
+
+- **Tens of millions of professional medical data**: Chinese/English professional papers, medical cases, medical textbooks, knowledge bases, etc.
+- **Hundreds of millions of medical Q&A and clinical data**: Covering complex medical reasoning and real-world clinical cases.
+- **Comprehensive data classification and evaluation**: Categorized by medical departments, content, and value to ensure balanced data distribution and filter out truly valuable medical data.
+
+## Data Synthesis and Optimization
+
+- **Synthetic data design**: Combining knowledge graphs, cases, and textbooks to generate diverse, high-quality medical reasoning data.
+- **Self-reflection mechanism and reward model**: Continuously improving the quality of synthetic data, ultimately generating **nearly a trillion tokens** of reasoning data, covering long-tail knowledge and complex scenarios.
+
+
+## General Data Collection
+
+- **20T multilingual general dataset**: Including 14T English data, 4T Chinese data, and 2T data covering 30 mainstream languages.
+- **Deduplication and upsampling strategy**: Upsampling high-quality data to significantly enhance model performance.
+- **27 global knowledge categories**: Optimizing data ratios based on small model experiments to balance general and domain-specific capabilities.
+
+---
+<a name="structure"></a>
+# 🧠 New Model Architecture
+
+## Short Convolution Attention Mechanism
+
+- By introducing lightweight short convolution operations when computing Key and Value, the reliance of standard Transformer models on induction heads for learning is significantly reduced. Traditional Transformers rely on induction heads to capture repetitive patterns and contextual dependencies in sequences, which requires a certain model width and depth. Short convolution decouples the Key and Value sequences in the time dimension, enhancing context learning capabilities. Extensive experiments from toy models to models with over ten billion parameters show that the short convolution attention mechanism excels in language modeling tasks, especially those heavily dependent on contextual information.
+
+
+## Sliding Window Attention Mechanism
+
+- Adopting a sliding window attention mechanism in some layers to reduce KV Cache memory usage.
+- **Optimization**: Balancing computational efficiency and performance, especially suitable for long-sequence tasks.
+
+## Optimizing Position Encoding Oscillation
+
+- By increasing the dimensions of some attention heads, RoPE curve oscillation is reduced.
+- **Result**: More stable performance in long-sequence tasks while maintaining the model's ability to capture diverse features.
+
+## High Peak Learning Rate Strategy
+
+- Using **WSD learning rate scheduling strategy** with high peak learning rates to promote model generalization.
+- **Comparison results**: Significant improvement in benchmark task performance.
+
+## Adaptive Gradient Update
+
+- **Dynamic gradient clipping**: Skipping updates when gradients are too large to reduce instability caused by special samples or steep loss spaces.
+
+---
+<a name="training"></a>
+# ⚙️ Training Methodology
+
+We innovatively adopted a **multi-stage curriculum learning and alignment optimization** approach, systematically enhancing model capabilities through the following two parts:
+
+## 1. Multi-Stage Curriculum Learning
+
+Training is divided into three stages, progressively optimizing the model's general and medical domain capabilities:
+
+1. **General Knowledge Enhancement Stage**: Focused on general language modeling to improve basic language and common sense.
+2. **Medical Basic Knowledge Enhancement Stage**: Introducing high-quality medical data to enhance reasoning, mathematical, and medical knowledge.
+3. **Medical Advanced Knowledge Enhancement Stage**: Further optimizing data quality, focusing on complex medical reasoning, disease diagnosis, and long-tail knowledge.
+
+## 2. Alignment Optimization
+
+Enhancing model generation quality, logical reasoning, and user preference alignment through reinforcement learning and pairwise data optimization:
+
+1. **Pairwise Data**: Covering multi-turn dialogues, instruction following, math and code, and reasoning tasks, sourced from human annotations and multi-model generation.
+2. **Optimization Process**:
+   - **ELO**: Optimizing diverse, high-quality chain-of-thought generation based on maximum likelihood.
+   - **TDPO**: Using pairwise data to optimize the generation model for better user preference alignment.
+   - **PPO**: Further enhancing generation logic and task performance through policy optimization.
+     
+
+This combined approach of multi-stage and alignment optimization enables the model to achieve exceptional performance in both general and medical domain capabilities.
+
+---
+<a name="benchmark"></a>
+# 📊 Benchmark Results
+
+Our evaluation covers all mainstream benchmarks, achieving excellent metrics in both open-source and closed-source evaluations, demonstrating outstanding medical scenario capabilities while maintaining strong general performance.
+
+<table style="border: 1px solid #000; border-collapse: collapse; width: 100%; text-align: center;">
+    <thead>
+        <tr>
+            <th>Category</th>
+            <th>Benchmark</th>
+            <th style="font-size:15px;">Baichuan-M1-14B-Instruct</th>
+            <th style="font-size:15px;">Qwen2.5-14B-Instruct</th>
+            <th style="font-size:15px;">Qwen2.5-72B-Instruct</th>
+            <th style="font-size:15px;">claude-3.5-sonnet-20241022</th>
+            <th style="font-size:15px;">gpt-4o</th>
+        </tr>
+    </thead>
+    <tbody>
+        <tr>
+            <td colspan="2" style="text-align: center;">Average Score</td>
+            <td>72.23</td>
+            <td>65.39</td>
+            <td>70.51</td>
+            <td>74.85</td>
+            <td>75.00</td>
+        </tr>
+        <tr>
+            <td rowspan="7" style="vertical-align: middle;">Clinical Practice</td>
+            <td style="text-align: left;">cmbclin</td>
+            <td>77.40</td>
+            <td>71.51</td>
+            <td>75.36</td>
+            <td>78.37</td>
+            <td>75.36</td>
+        </tr>
+        <tr>
+            <td style="text-align: left;">clinicalbench_diag</td>
+            <td>70.90</td>
+            <td>68.85</td>
+            <td>72.23</td>
+            <td>75.00</td>
+            <td>73.05</td>
+        </tr>
+        <tr>
+            <td style="text-align: left;">clinicalbench_hos</td>
+            <td>70.05</td>
+            <td>68.83</td>
+            <td>70.53</td>
+            <td>65.58</td>
+            <td>69.38</td>
+        </tr>
+        <tr>
+            <td style="text-align: left;">clinicalbench_treat</td>
+            <td>56.38</td>
+            <td>55.03</td>
+            <td>57.30</td>
+            <td>64.03</td>
+            <td>59.35</td>
+        </tr>
+        <tr>
+            <td style="text-align: left;">rarearena_rdc</td>
+            <td>81.80</td>
+            <td>66.40</td>
+            <td>76.20</td>
+            <td>89.60</td>
+            <td>88.40</td>
+        </tr>
+        <tr>
+            <td style="text-align: left;">rarearena_rds</td>
+            <td>54.00</td>
+            <td>42.60</td>
+            <td>49.80</td>
+            <td>59.80</td>
+            <td>57.20</td>
+        </tr>
+        <tr>
+            <td style="text-align: left;">rarebench</td>
+            <td>59.60</td>
+            <td>52.80</td>
+            <td>60.60</td>
+            <td>65.30</td>
+            <td>62.80</td>
+        </tr>
+        <tr>
+            <td rowspan="10" style="vertical-align: middle;">Exams</td>
+            <td style="text-align: left;">cmexam</td>
+            <td>80.10</td>
+            <td>77.70</td>
+            <td>82.70</td>
+            <td>77.50</td>
+            <td>78.00</td>
+        </tr>
+        <tr>
+            <td style="text-align: left;">Pediatric Qualification Exam</td>
+            <td>78.48</td>
+            <td>74.68</td>
+            <td>84.81</td>
+            <td>76.58</td>
+            <td>78.48</td>
+        </tr>
+        <tr>
+            <td style="text-align: left;">Internal Medicine Qualification Exam</td>
+            <td>83.42</td>
+            <td>86.10</td>
+            <td>87.17</td>
+            <td>87.70</td>
+            <td>83.42</td>
+        </tr>
+        <tr>
+            <td style="text-align: left;">General Practice Qualification Exam</td>
+            <td>87.07</td>
+            <td>88.44</td>
+            <td>88.44</td>
+            <td>81.63</td>
+            <td>84.35</td>
+        </tr>
+        <tr>
+            <td style="text-align: left;">USMLE</td>
+            <td>78.00</td>
+            <td>67.20</td>
+            <td>76.70</td>
+            <td>85.90</td>
+            <td>87.10</td>
+        </tr>
+        <tr>
+            <td style="text-align: left;">medbullets</td>
+            <td>66.88</td>
+            <td>54.22</td>
+            <td>64.29</td>
+            <td>72.40</td>
+            <td>75.97</td>
+        </tr>
+        <tr>
+            <td style="text-align: left;">mediq</td>
+            <td>83.40</td>
+            <td>66.80</td>
+            <td>79.90</td>
+            <td>88.80</td>
+            <td>90.20</td>
+        </tr>
+        <tr>
+            <td style="text-align: left;">nejmqa</td>
+            <td>49.75</td>
+            <td>45.69</td>
+            <td>50.76</td>
+            <td>69.54</td>
+            <td>54.31</td>
+        </tr>
+        <tr>
+            <td style="text-align: left;">pubmedqa</td>
+            <td>75.20</td>
+            <td>76.40</td>
+            <td>75.60</td>
+            <td>77.00</td>
+            <td>77.60</td>
+        </tr>
+        <tr>
+            <td style="text-align: left;">redisqa</td>
+            <td>74.50</td>
+            <td>69.70</td>
+            <td>75.00</td>
+            <td>83.20</td>
+            <td>82.80</td>
+        </tr>
+        <tr>
+            <td rowspan="5" style="vertical-align: middle;">Basic Capabilities</td>
+            <td style="text-align: left;">mednli_dis</td>
+            <td>80.40</td>
+            <td>68.90</td>
+            <td>74.90</td>
+            <td>58.30</td>
+            <td>79.80</td>
+        </tr>
+        <tr>
+            <td style="text-align: left;">medcalc</td>
+            <td>56.00</td>
+            <td>31.40</td>
+            <td>37.90</td>
+            <td>52.60</td>
+            <td>49.00</td>
+        </tr>
+        <tr>
+            <td style="text-align: left;">MMLU-anatomy</td>
+            <td>80.00</td>
+            <td>67.41</td>
+            <td>71.11</td>
+            <td>86.67</td>
+            <td>91.11</td>
+        </tr>
+        <tr>
+            <td style="text-align: left;">MMLU-virology</td>
+            <td>54.82</td>
+            <td>56.02</td>
+            <td>53.01</td>
+            <td>54.22</td>
+            <td>57.23</td>
+        </tr>
+        <tr>
+            <td style="text-align: left;">MMLU-genetics</td>
+            <td>91.00</td>
+            <td>82.00</td>
+            <td>87.00</td>
+            <td>97.00</td>
+            <td>95.00</td>
+        </tr>
+    </tbody>
+</table>
+
+
+---
+<a name="quick"></a>
+# 🚀 Quick Start
+
+### 🤗 Hugging Face Transformers
+
+We recommend using the latest version of the Transformers library (at least 4.47.0). The following code snippet demonstrates how to use the **Baichuan-M1-14B-Instruct** model:
+
+```python
+from transformers import AutoModelForCausalLM, AutoTokenizer
+import torch
+# 1. Load pre-trained model and tokenizer
+model_name = "baichuan-inc/Baichuan-M1-14B-Instruct"  
+tokenizer = AutoTokenizer.from_pretrained(model_name,trust_remote_code=True)
+model = AutoModelForCausalLM.from_pretrained(model_name,trust_remote_code=True,torch_dtype = torch.bfloat16).cuda()
+# 2. Input prompt text
+prompt = "May I ask you some questions about medical knowledge?"
+
+# 3. Encode the input text for the model
+messages = [
+    {"role": "system", "content": "You are a helpful assistant."},
+    {"role": "user", "content": prompt}
+]
+text = tokenizer.apply_chat_template(
+    messages,
+    tokenize=False,
+    add_generation_prompt=True
+)
+model_inputs = tokenizer([text], return_tensors="pt").to(model.device)
+
+# 4. Generate text
+generated_ids = model.generate(
+    **model_inputs,
+    max_new_tokens=512
+)
+generated_ids = [
+    output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
+]
+
+# 5. Decode the generated text
+response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
+
+
+# 6. Output the result
+print("Generated text:")
+print(response)
+```
+
+---
+<a name="declare"></a>
+# 📜 License and Statement
+The use of the model must comply with [《Baichuan-M1-14B模型社区许可协议》](https://github.com/baichuan-inc/Baichuan-M1-14B/blob/main/Baichuan-M1-14B模型社区许可协议.pdf).
+
+The development team of Baichuan has not developed any commercial applications based on this model. All users must comply with laws and regulations and must not use the model for harmful national security or illegal purposes.
+
+---
+<a name="reference"></a>
+# 🏷️ Reference
+If you need to cite our work, please use the following reference:
+```
+@article{baichuan-m1-2025,
+  title={Baichuan-M1: Pushing the Medical Capability of Large Language Models},
+  author={Bingning Wang, Haizhou Zhao, Huozhi Zhou, Liang Song, Mingyu Xu, Wei Cheng, Xiangrong Zeng, Yupeng Zhang, Yuqi Huo, Zecheng Wang, Zhengyun Zhao and others},
+  journal={arXiv preprint arXiv:2502.12671},
+  year={2025}
+}
+```
\ No newline at end of file
diff --git a/config.json b/config.json
new file mode 100644
index 0000000..eb8aa70
--- /dev/null
+++ b/config.json
@@ -0,0 +1,56 @@
+{
+  "architectures": [
+    "BaichuanM1ForCausalLM"
+  ],
+  "attention_dropout": 0.0,
+  "auto_map": {
+    "AutoConfig": "configuration_baichuan.BaichuanM1Config",
+    "AutoModelForCausalLM": "modeling_baichuan.BaichuanM1ForCausalLM"
+  },
+  "bos_token_id": 1,
+  "conv_window": 2,
+  "eos_token_id": 2,
+  "hidden_act": "silu",
+  "hidden_size": 5120,
+  "initializer_range": 0.02,
+  "intermediate_size": 17408,
+  "max_position_embeddings": 32768,
+  "model_max_length": 32768,
+  "model_type": "baichuan_m1",
+  "num_attention_heads": 20,
+  "num_hidden_layers": 40,
+  "num_key_value_heads": 2,
+  "num_swa_attention_heads": 40,
+  "num_swa_key_value_heads": 8,
+  "pad_token_id": 0,
+  "rms_norm_eps": 1e-06,
+  "rope_theta": 1000000.0,
+  "sliding_window": 8192,
+  "sliding_window_layers": [
+    1,
+    3,
+    5,
+    7,
+    9,
+    11,
+    13,
+    15,
+    17,
+    19,
+    21,
+    23,
+    25,
+    27,
+    29,
+    31,
+    33,
+    35,
+    37,
+    39
+  ],
+  "tie_word_embeddings": false,
+  "torch_dtype": "bfloat16",
+  "transformers_version": "4.48.1",
+  "use_cache": true,
+  "vocab_size": 133120
+}
diff --git a/configuration_baichuan.py b/configuration_baichuan.py
new file mode 100644
index 0000000..91e9e40
--- /dev/null
+++ b/configuration_baichuan.py
@@ -0,0 +1,119 @@
+# coding=utf-8
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from transformers import PretrainedConfig
+from transformers.utils import logging
+
+logger = logging.get_logger(__name__)
+
+
+class BaichuanM1Config(PretrainedConfig):
+    r"""
+    Configuration objects inherit from [`PretrainedConfig`] and control the behavior of model outputs. For more details, 
+    refer to the documentation of [`PretrainedConfig`].
+
+    Args:
+        vocab_size (`int`, *optional*, defaults to 133120):
+            The size of the vocabulary used by the model.
+        hidden_size (`int`, *optional*, defaults to 4096):
+            The dimensionality of the hidden representations.
+        intermediate_size (`int`, *optional*, defaults to 22016):
+            The dimensionality of the intermediate (MLP) representations.
+        num_hidden_layers (`int`, *optional*, defaults to 32):
+            The number of hidden layers in the Transformer encoder.
+        num_attention_heads (`int`, *optional*, defaults to 32):
+            The number of attention heads for each attention layer in the Transformer encoder.
+        num_key_value_heads (`int`, *optional*, defaults to 32):
+            The number of key-value heads used to implement Grouped Query Attention (GQA). 
+            - If `num_key_value_heads == num_attention_heads`, the model uses Multi-Head Attention (MHA).
+            - If `num_key_value_heads == 1`, the model uses Multi-Query Attention (MQA).
+            - Otherwise, the model uses Grouped Query Attention (GQA). 
+            When converting a multi-head checkpoint to a GQA checkpoint, each group's key and value heads are constructed 
+            by mean-pooling the original heads within that group. For more details, refer to [this paper](https://arxiv.org/pdf/2305.13245.pdf). 
+            If not specified, this defaults to `32`.
+        hidden_act (`str` or `function`, *optional*, defaults to `"silu"`):
+            The non-linear activation function (either a string or a callable function) used in the decoder.
+        max_position_embeddings (`int`, *optional*, defaults to 32768):
+            The maximum sequence length the model can handle.
+        initializer_range (`float`, *optional*, defaults to 0.02):
+            The standard deviation of the truncated normal initializer for initializing all weight matrices.
+        rms_norm_eps (`float`, *optional*, defaults to 1e-06):
+            The epsilon value used by the RMS normalization layers.
+        use_cache (`bool`, *optional*, defaults to `True`):
+            Whether the model should return the last key/value attentions. This is only relevant if `config.is_decoder=True`.
+        tie_word_embeddings (`bool`, *optional*, defaults to `False`):
+            Whether to tie the model's input and output word embeddings.
+        rope_theta (`float`, *optional*, defaults to 10000.0):
+            The base period of the Rotary Position Embeddings (RoPE).
+        use_sliding_window (`bool`, *optional*, defaults to `False`):
+            Whether to enable sliding window attention.
+        sliding_window (`int`, *optional*, defaults to 4096):
+            The size of the sliding window for sliding window attention (SWA). If not specified, it defaults to `2048`.
+        attention_dropout (`float`, *optional*, defaults to 0.0):
+            The dropout ratio applied to the attention probabilities.
+    """
+
+    model_type = "baichuan"
+    keys_to_ignore_at_inference = ["past_key_values"]
+
+    def __init__(
+            self,
+            vocab_size=133120,
+            hidden_size=5120,
+            intermediate_size=17408,
+            num_hidden_layers=40,
+            num_attention_heads=40,
+            num_key_value_heads=2,
+            num_swa_attention_heads: int = 20,
+            num_swa_key_value_heads=8,
+            sliding_window_layers: list = None,
+            hidden_act="silu",
+            max_position_embeddings=32768,
+            initializer_range=0.02,
+            rms_norm_eps=1e-6,
+            use_cache=True,
+            tie_word_embeddings=False,
+            rope_theta=100000.0,
+            sliding_window=2048,
+            attention_dropout=0.0,
+            conv_window = 2,
+            **kwargs,
+    ):
+        self.sliding_window_layers = sliding_window_layers
+        self.num_swa_key_value_heads = num_swa_key_value_heads
+        self.num_swa_attention_heads = num_swa_attention_heads
+        self.vocab_size = vocab_size
+        self.max_position_embeddings = max_position_embeddings
+        self.hidden_size = hidden_size
+        self.intermediate_size = intermediate_size
+        self.num_hidden_layers = num_hidden_layers
+        self.num_attention_heads = num_attention_heads
+        self.sliding_window = sliding_window
+
+        # for backward compatibility
+        if num_key_value_heads is None:
+            num_key_value_heads = num_attention_heads
+
+        self.num_key_value_heads = num_key_value_heads
+        self.hidden_act = hidden_act
+        self.initializer_range = initializer_range
+        self.rms_norm_eps = rms_norm_eps
+        self.use_cache = use_cache
+        self.rope_theta = rope_theta
+        self.attention_dropout = attention_dropout
+        self.conv_window = conv_window
+        super().__init__(
+            tie_word_embeddings=tie_word_embeddings,
+            **kwargs,
+        )
diff --git a/generation_config.json b/generation_config.json
new file mode 100644
index 0000000..d75874e
--- /dev/null
+++ b/generation_config.json
@@ -0,0 +1,14 @@
+{
+  "assistant_token_id": 74,
+  "bos_token_id": 1,
+  "do_sample": true,
+  "eos_token_id": 2,
+  "max_new_tokens": 2048,
+  "pad_token_id": 0,
+  "repetition_penalty": 1.05,
+  "temperature": 0.3,
+  "top_k": 5,
+  "top_p": 0.85,
+  "transformers_version": "4.48.1",
+  "user_token_id": 73
+}
diff --git a/model-00001-of-00006.safetensors b/model-00001-of-00006.safetensors
new file mode 100644
index 0000000..7f82563
--- /dev/null
+++ b/model-00001-of-00006.safetensors
@@ -0,0 +1,3 @@
+version https://git-lfs.github.com/spec/v1
+oid sha256:d4877c584176284bd6fef77e51fbd2ef96c8b38852a1f08ac2cdc8ea46179a4c
+size 4938901448
diff --git a/model-00002-of-00006.safetensors b/model-00002-of-00006.safetensors
new file mode 100644
index 0000000..d888303
--- /dev/null
+++ b/model-00002-of-00006.safetensors
@@ -0,0 +1,3 @@
+version https://git-lfs.github.com/spec/v1
+oid sha256:d987caf8f98c6c4dedb8466840dd4fb76a615c7d831840e65d22276b33a66e85
+size 4938965032
diff --git a/model-00004-of-00006.safetensors b/model-00004-of-00006.safetensors
new file mode 100644
index 0000000..56f4efc
--- /dev/null
+++ b/model-00004-of-00006.safetensors
@@ -0,0 +1,3 @@
+version https://git-lfs.github.com/spec/v1
+oid sha256:d0c855b72caab92ee27789d09240551e5d3df8e4389c7c6d19e009cba841c5ff
+size 4949450824
diff --git a/model-00005-of-00006.safetensors b/model-00005-of-00006.safetensors
new file mode 100644
index 0000000..650d7aa
--- /dev/null
+++ b/model-00005-of-00006.safetensors
@@ -0,0 +1,3 @@
+version https://git-lfs.github.com/spec/v1
+oid sha256:d0a66bd47f70c4752c32b3ae05563c62ad03a8627da0345cdd8953d244330927
+size 4886515440
diff --git a/model-00006-of-00006.safetensors b/model-00006-of-00006.safetensors
new file mode 100644
index 0000000..999f489
--- /dev/null
+++ b/model-00006-of-00006.safetensors
@@ -0,0 +1,3 @@
+version https://git-lfs.github.com/spec/v1
+oid sha256:640fd7f8c27d6a8fb7c87b339d7f451075081c16b684f8b78d1dcb923a4c32f0
+size 4341222272
diff --git a/model.safetensors.index.json b/model.safetensors.index.json
new file mode 100644
index 0000000..2a6b952
--- /dev/null
+++ b/model.safetensors.index.json
@@ -0,0 +1,370 @@
+{
+  "metadata": {
+    "total_size": 28941528640
+  },
+  "weight_map": {
+    "lm_head.weight": "model-00006-of-00006.safetensors",
+    "model.embed_tokens.weight": "model-00001-of-00006.safetensors",
+    "model.layers.0.input_layernorm.weight": "model-00001-of-00006.safetensors",
+    "model.layers.0.mlp.down_proj.weight": "model-00001-of-00006.safetensors",
+    "model.layers.0.mlp.gate_proj.weight": "model-00001-of-00006.safetensors",
+    "model.layers.0.mlp.up_proj.weight": "model-00001-of-00006.safetensors",
+    "model.layers.0.post_attention_layernorm.weight": "model-00001-of-00006.safetensors",
+    "model.layers.0.self_attn.W_pack.weight": "model-00001-of-00006.safetensors",
+    "model.layers.0.self_attn.conv_k": "model-00001-of-00006.safetensors",
+    "model.layers.0.self_attn.conv_v": "model-00001-of-00006.safetensors",
+    "model.layers.0.self_attn.o_proj.weight": "model-00001-of-00006.safetensors",
+    "model.layers.1.input_layernorm.weight": "model-00001-of-00006.safetensors",
+    "model.layers.1.mlp.down_proj.weight": "model-00001-of-00006.safetensors",
+    "model.layers.1.mlp.gate_proj.weight": "model-00001-of-00006.safetensors",
+    "model.layers.1.mlp.up_proj.weight": "model-00001-of-00006.safetensors",
+    "model.layers.1.post_attention_layernorm.weight": "model-00001-of-00006.safetensors",
+    "model.layers.1.self_attn.W_pack.weight": "model-00001-of-00006.safetensors",
+    "model.layers.1.self_attn.conv_k": "model-00001-of-00006.safetensors",
+    "model.layers.1.self_attn.conv_v": "model-00001-of-00006.safetensors",
+    "model.layers.1.self_attn.o_proj.weight": "model-00001-of-00006.safetensors",
+    "model.layers.10.input_layernorm.weight": "model-00002-of-00006.safetensors",
+    "model.layers.10.mlp.down_proj.weight": "model-00002-of-00006.safetensors",
+    "model.layers.10.mlp.gate_proj.weight": "model-00002-of-00006.safetensors",
+    "model.layers.10.mlp.up_proj.weight": "model-00002-of-00006.safetensors",
+    "model.layers.10.post_attention_layernorm.weight": "model-00002-of-00006.safetensors",
+    "model.layers.10.self_attn.W_pack.weight": "model-00002-of-00006.safetensors",
+    "model.layers.10.self_attn.conv_k": "model-00002-of-00006.safetensors",
+    "model.layers.10.self_attn.conv_v": "model-00002-of-00006.safetensors",
+    "model.layers.10.self_attn.o_proj.weight": "model-00002-of-00006.safetensors",
+    "model.layers.11.input_layernorm.weight": "model-00002-of-00006.safetensors",
+    "model.layers.11.mlp.down_proj.weight": "model-00002-of-00006.safetensors",
+    "model.layers.11.mlp.gate_proj.weight": "model-00002-of-00006.safetensors",
+    "model.layers.11.mlp.up_proj.weight": "model-00002-of-00006.safetensors",
+    "model.layers.11.post_attention_layernorm.weight": "model-00002-of-00006.safetensors",
+    "model.layers.11.self_attn.W_pack.weight": "model-00002-of-00006.safetensors",
+    "model.layers.11.self_attn.conv_k": "model-00002-of-00006.safetensors",
+    "model.layers.11.self_attn.conv_v": "model-00002-of-00006.safetensors",
+    "model.layers.11.self_attn.o_proj.weight": "model-00002-of-00006.safetensors",
+    "model.layers.12.input_layernorm.weight": "model-00002-of-00006.safetensors",
+    "model.layers.12.mlp.down_proj.weight": "model-00002-of-00006.safetensors",
+    "model.layers.12.mlp.gate_proj.weight": "model-00002-of-00006.safetensors",
+    "model.layers.12.mlp.up_proj.weight": "model-00002-of-00006.safetensors",
+    "model.layers.12.post_attention_layernorm.weight": "model-00002-of-00006.safetensors",
+    "model.layers.12.self_attn.W_pack.weight": "model-00002-of-00006.safetensors",
+    "model.layers.12.self_attn.conv_k": "model-00002-of-00006.safetensors",
+    "model.layers.12.self_attn.conv_v": "model-00002-of-00006.safetensors",
+    "model.layers.12.self_attn.o_proj.weight": "model-00002-of-00006.safetensors",
+    "model.layers.13.input_layernorm.weight": "model-00003-of-00006.safetensors",
+    "model.layers.13.mlp.down_proj.weight": "model-00003-of-00006.safetensors",
+    "model.layers.13.mlp.gate_proj.weight": "model-00003-of-00006.safetensors",
+    "model.layers.13.mlp.up_proj.weight": "model-00003-of-00006.safetensors",
+    "model.layers.13.post_attention_layernorm.weight": "model-00003-of-00006.safetensors",
+    "model.layers.13.self_attn.W_pack.weight": "model-00003-of-00006.safetensors",
+    "model.layers.13.self_attn.conv_k": "model-00002-of-00006.safetensors",
+    "model.layers.13.self_attn.conv_v": "model-00002-of-00006.safetensors",
+    "model.layers.13.self_attn.o_proj.weight": "model-00003-of-00006.safetensors",
+    "model.layers.14.input_layernorm.weight": "model-00003-of-00006.safetensors",
+    "model.layers.14.mlp.down_proj.weight": "model-00003-of-00006.safetensors",
+    "model.layers.14.mlp.gate_proj.weight": "model-00003-of-00006.safetensors",
+    "model.layers.14.mlp.up_proj.weight": "model-00003-of-00006.safetensors",
+    "model.layers.14.post_attention_layernorm.weight": "model-00003-of-00006.safetensors",
+    "model.layers.14.self_attn.W_pack.weight": "model-00003-of-00006.safetensors",
+    "model.layers.14.self_attn.conv_k": "model-00003-of-00006.safetensors",
+    "model.layers.14.self_attn.conv_v": "model-00003-of-00006.safetensors",
+    "model.layers.14.self_attn.o_proj.weight": "model-00003-of-00006.safetensors",
+    "model.layers.15.input_layernorm.weight": "model-00003-of-00006.safetensors",
+    "model.layers.15.mlp.down_proj.weight": "model-00003-of-00006.safetensors",
+    "model.layers.15.mlp.gate_proj.weight": "model-00003-of-00006.safetensors",
+    "model.layers.15.mlp.up_proj.weight": "model-00003-of-00006.safetensors",
+    "model.layers.15.post_attention_layernorm.weight": "model-00003-of-00006.safetensors",
+    "model.layers.15.self_attn.W_pack.weight": "model-00003-of-00006.safetensors",
+    "model.layers.15.self_attn.conv_k": "model-00003-of-00006.safetensors",
+    "model.layers.15.self_attn.conv_v": "model-00003-of-00006.safetensors",
+    "model.layers.15.self_attn.o_proj.weight": "model-00003-of-00006.safetensors",
+    "model.layers.16.input_layernorm.weight": "model-00003-of-00006.safetensors",
+    "model.layers.16.mlp.down_proj.weight": "model-00003-of-00006.safetensors",
+    "model.layers.16.mlp.gate_proj.weight": "model-00003-of-00006.safetensors",
+    "model.layers.16.mlp.up_proj.weight": "model-00003-of-00006.safetensors",
+    "model.layers.16.post_attention_layernorm.weight": "model-00003-of-00006.safetensors",
+    "model.layers.16.self_attn.W_pack.weight": "model-00003-of-00006.safetensors",
+    "model.layers.16.self_attn.conv_k": "model-00003-of-00006.safetensors",
+    "model.layers.16.self_attn.conv_v": "model-00003-of-00006.safetensors",
+    "model.layers.16.self_attn.o_proj.weight": "model-00003-of-00006.safetensors",
+    "model.layers.17.input_layernorm.weight": "model-00003-of-00006.safetensors",
+    "model.layers.17.mlp.down_proj.weight": "model-00003-of-00006.safetensors",
+    "model.layers.17.mlp.gate_proj.weight": "model-00003-of-00006.safetensors",
+    "model.layers.17.mlp.up_proj.weight": "model-00003-of-00006.safetensors",
+    "model.layers.17.post_attention_layernorm.weight": "model-00003-of-00006.safetensors",
+    "model.layers.17.self_attn.W_pack.weight": "model-00003-of-00006.safetensors",
+    "model.layers.17.self_attn.conv_k": "model-00003-of-00006.safetensors",
+    "model.layers.17.self_attn.conv_v": "model-00003-of-00006.safetensors",
+    "model.layers.17.self_attn.o_proj.weight": "model-00003-of-00006.safetensors",
+    "model.layers.18.input_layernorm.weight": "model-00003-of-00006.safetensors",
+    "model.layers.18.mlp.down_proj.weight": "model-00003-of-00006.safetensors",
+    "model.layers.18.mlp.gate_proj.weight": "model-00003-of-00006.safetensors",
+    "model.layers.18.mlp.up_proj.weight": "model-00003-of-00006.safetensors",
+    "model.layers.18.post_attention_layernorm.weight": "model-00003-of-00006.safetensors",
+    "model.layers.18.self_attn.W_pack.weight": "model-00003-of-00006.safetensors",
+    "model.layers.18.self_attn.conv_k": "model-00003-of-00006.safetensors",
+    "model.layers.18.self_attn.conv_v": "model-00003-of-00006.safetensors",
+    "model.layers.18.self_attn.o_proj.weight": "model-00003-of-00006.safetensors",
+    "model.layers.19.input_layernorm.weight": "model-00003-of-00006.safetensors",
+    "model.layers.19.mlp.down_proj.weight": "model-00003-of-00006.safetensors",
+    "model.layers.19.mlp.gate_proj.weight": "model-00003-of-00006.safetensors",
+    "model.layers.19.mlp.up_proj.weight": "model-00003-of-00006.safetensors",
+    "model.layers.19.post_attention_layernorm.weight": "model-00003-of-00006.safetensors",
+    "model.layers.19.self_attn.W_pack.weight": "model-00003-of-00006.safetensors",
+    "model.layers.19.self_attn.conv_k": "model-00003-of-00006.safetensors",
+    "model.layers.19.self_attn.conv_v": "model-00003-of-00006.safetensors",
+    "model.layers.19.self_attn.o_proj.weight": "model-00003-of-00006.safetensors",
+    "model.layers.2.input_layernorm.weight": "model-00001-of-00006.safetensors",
+    "model.layers.2.mlp.down_proj.weight": "model-00001-of-00006.safetensors",
+    "model.layers.2.mlp.gate_proj.weight": "model-00001-of-00006.safetensors",
+    "model.layers.2.mlp.up_proj.weight": "model-00001-of-00006.safetensors",
+    "model.layers.2.post_attention_layernorm.weight": "model-00001-of-00006.safetensors",
+    "model.layers.2.self_attn.W_pack.weight": "model-00001-of-00006.safetensors",
+    "model.layers.2.self_attn.conv_k": "model-00001-of-00006.safetensors",
+    "model.layers.2.self_attn.conv_v": "model-00001-of-00006.safetensors",
+    "model.layers.2.self_attn.o_proj.weight": "model-00001-of-00006.safetensors",
+    "model.layers.20.input_layernorm.weight": "model-00004-of-00006.safetensors",
+    "model.layers.20.mlp.down_proj.weight": "model-00004-of-00006.safetensors",
+    "model.layers.20.mlp.gate_proj.weight": "model-00003-of-00006.safetensors",
+    "model.layers.20.mlp.up_proj.weight": "model-00004-of-00006.safetensors",
+    "model.layers.20.post_attention_layernorm.weight": "model-00004-of-00006.safetensors",
+    "model.layers.20.self_attn.W_pack.weight": "model-00003-of-00006.safetensors",
+    "model.layers.20.self_attn.conv_k": "model-00003-of-00006.safetensors",
+    "model.layers.20.self_attn.conv_v": "model-00003-of-00006.safetensors",
+    "model.layers.20.self_attn.o_proj.weight": "model-00003-of-00006.safetensors",
+    "model.layers.21.input_layernorm.weight": "model-00004-of-00006.safetensors",
+    "model.layers.21.mlp.down_proj.weight": "model-00004-of-00006.safetensors",
+    "model.layers.21.mlp.gate_proj.weight": "model-00004-of-00006.safetensors",
+    "model.layers.21.mlp.up_proj.weight": "model-00004-of-00006.safetensors",
+    "model.layers.21.post_attention_layernorm.weight": "model-00004-of-00006.safetensors",
+    "model.layers.21.self_attn.W_pack.weight": "model-00004-of-00006.safetensors",
+    "model.layers.21.self_attn.conv_k": "model-00004-of-00006.safetensors",
+    "model.layers.21.self_attn.conv_v": "model-00004-of-00006.safetensors",
+    "model.layers.21.self_attn.o_proj.weight": "model-00004-of-00006.safetensors",
+    "model.layers.22.input_layernorm.weight": "model-00004-of-00006.safetensors",
+    "model.layers.22.mlp.down_proj.weight": "model-00004-of-00006.safetensors",
+    "model.layers.22.mlp.gate_proj.weight": "model-00004-of-00006.safetensors",
+    "model.layers.22.mlp.up_proj.weight": "model-00004-of-00006.safetensors",
+    "model.layers.22.post_attention_layernorm.weight": "model-00004-of-00006.safetensors",
+    "model.layers.22.self_attn.W_pack.weight": "model-00004-of-00006.safetensors",
+    "model.layers.22.self_attn.conv_k": "model-00004-of-00006.safetensors",
+    "model.layers.22.self_attn.conv_v": "model-00004-of-00006.safetensors",
+    "model.layers.22.self_attn.o_proj.weight": "model-00004-of-00006.safetensors",
+    "model.layers.23.input_layernorm.weight": "model-00004-of-00006.safetensors",
+    "model.layers.23.mlp.down_proj.weight": "model-00004-of-00006.safetensors",
+    "model.layers.23.mlp.gate_proj.weight": "model-00004-of-00006.safetensors",
+    "model.layers.23.mlp.up_proj.weight": "model-00004-of-00006.safetensors",
+    "model.layers.23.post_attention_layernorm.weight": "model-00004-of-00006.safetensors",
+    "model.layers.23.self_attn.W_pack.weight": "model-00004-of-00006.safetensors",
+    "model.layers.23.self_attn.conv_k": "model-00004-of-00006.safetensors",
+    "model.layers.23.self_attn.conv_v": "model-00004-of-00006.safetensors",
+    "model.layers.23.self_attn.o_proj.weight": "model-00004-of-00006.safetensors",
+    "model.layers.24.input_layernorm.weight": "model-00004-of-00006.safetensors",
+    "model.layers.24.mlp.down_proj.weight": "model-00004-of-00006.safetensors",
+    "model.layers.24.mlp.gate_proj.weight": "model-00004-of-00006.safetensors",
+    "model.layers.24.mlp.up_proj.weight": "model-00004-of-00006.safetensors",
+    "model.layers.24.post_attention_layernorm.weight": "model-00004-of-00006.safetensors",
+    "model.layers.24.self_attn.W_pack.weight": "model-00004-of-00006.safetensors",
+    "model.layers.24.self_attn.conv_k": "model-00004-of-00006.safetensors",
+    "model.layers.24.self_attn.conv_v": "model-00004-of-00006.safetensors",
+    "model.layers.24.self_attn.o_proj.weight": "model-00004-of-00006.safetensors",
+    "model.layers.25.input_layernorm.weight": "model-00004-of-00006.safetensors",
+    "model.layers.25.mlp.down_proj.weight": "model-00004-of-00006.safetensors",
+    "model.layers.25.mlp.gate_proj.weight": "model-00004-of-00006.safetensors",
+    "model.layers.25.mlp.up_proj.weight": "model-00004-of-00006.safetensors",
+    "model.layers.25.post_attention_layernorm.weight": "model-00004-of-00006.safetensors",
+    "model.layers.25.self_attn.W_pack.weight": "model-00004-of-00006.safetensors",
+    "model.layers.25.self_attn.conv_k": "model-00004-of-00006.safetensors",
+    "model.layers.25.self_attn.conv_v": "model-00004-of-00006.safetensors",
+    "model.layers.25.self_attn.o_proj.weight": "model-00004-of-00006.safetensors",
+    "model.layers.26.input_layernorm.weight": "model-00004-of-00006.safetensors",
+    "model.layers.26.mlp.down_proj.weight": "model-00004-of-00006.safetensors",
+    "model.layers.26.mlp.gate_proj.weight": "model-00004-of-00006.safetensors",
+    "model.layers.26.mlp.up_proj.weight": "model-00004-of-00006.safetensors",
+    "model.layers.26.post_attention_layernorm.weight": "model-00004-of-00006.safetensors",
+    "model.layers.26.self_attn.W_pack.weight": "model-00004-of-00006.safetensors",
+    "model.layers.26.self_attn.conv_k": "model-00004-of-00006.safetensors",
+    "model.layers.26.self_attn.conv_v": "model-00004-of-00006.safetensors",
+    "model.layers.26.self_attn.o_proj.weight": "model-00004-of-00006.safetensors",
+    "model.layers.27.input_layernorm.weight": "model-00004-of-00006.safetensors",
+    "model.layers.27.mlp.down_proj.weight": "model-00004-of-00006.safetensors",
+    "model.layers.27.mlp.gate_proj.weight": "model-00004-of-00006.safetensors",
+    "model.layers.27.mlp.up_proj.weight": "model-00004-of-00006.safetensors",
+    "model.layers.27.post_attention_layernorm.weight": "model-00004-of-00006.safetensors",
+    "model.layers.27.self_attn.W_pack.weight": "model-00004-of-00006.safetensors",
+    "model.layers.27.self_attn.conv_k": "model-00004-of-00006.safetensors",
+    "model.layers.27.self_attn.conv_v": "model-00004-of-00006.safetensors",
+    "model.layers.27.self_attn.o_proj.weight": "model-00004-of-00006.safetensors",
+    "model.layers.28.input_layernorm.weight": "model-00005-of-00006.safetensors",
+    "model.layers.28.mlp.down_proj.weight": "model-00005-of-00006.safetensors",
+    "model.layers.28.mlp.gate_proj.weight": "model-00005-of-00006.safetensors",
+    "model.layers.28.mlp.up_proj.weight": "model-00005-of-00006.safetensors",
+    "model.layers.28.post_attention_layernorm.weight": "model-00005-of-00006.safetensors",
+    "model.layers.28.self_attn.W_pack.weight": "model-00005-of-00006.safetensors",
+    "model.layers.28.self_attn.conv_k": "model-00004-of-00006.safetensors",
+    "model.layers.28.self_attn.conv_v": "model-00004-of-00006.safetensors",
+    "model.layers.28.self_attn.o_proj.weight": "model-00005-of-00006.safetensors",
+    "model.layers.29.input_layernorm.weight": "model-00005-of-00006.safetensors",
+    "model.layers.29.mlp.down_proj.weight": "model-00005-of-00006.safetensors",
+    "model.layers.29.mlp.gate_proj.weight": "model-00005-of-00006.safetensors",
+    "model.layers.29.mlp.up_proj.weight": "model-00005-of-00006.safetensors",
+    "model.layers.29.post_attention_layernorm.weight": "model-00005-of-00006.safetensors",
+    "model.layers.29.self_attn.W_pack.weight": "model-00005-of-00006.safetensors",
+    "model.layers.29.self_attn.conv_k": "model-00005-of-00006.safetensors",
+    "model.layers.29.self_attn.conv_v": "model-00005-of-00006.safetensors",
+    "model.layers.29.self_attn.o_proj.weight": "model-00005-of-00006.safetensors",
+    "model.layers.3.input_layernorm.weight": "model-00001-of-00006.safetensors",
+    "model.layers.3.mlp.down_proj.weight": "model-00001-of-00006.safetensors",
+    "model.layers.3.mlp.gate_proj.weight": "model-00001-of-00006.safetensors",
+    "model.layers.3.mlp.up_proj.weight": "model-00001-of-00006.safetensors",
+    "model.layers.3.post_attention_layernorm.weight": "model-00001-of-00006.safetensors",
+    "model.layers.3.self_attn.W_pack.weight": "model-00001-of-00006.safetensors",
+    "model.layers.3.self_attn.conv_k": "model-00001-of-00006.safetensors",
+    "model.layers.3.self_attn.conv_v": "model-00001-of-00006.safetensors",
+    "model.layers.3.self_attn.o_proj.weight": "model-00001-of-00006.safetensors",
+    "model.layers.30.input_layernorm.weight": "model-00005-of-00006.safetensors",
+    "model.layers.30.mlp.down_proj.weight": "model-00005-of-00006.safetensors",
+    "model.layers.30.mlp.gate_proj.weight": "model-00005-of-00006.safetensors",
+    "model.layers.30.mlp.up_proj.weight": "model-00005-of-00006.safetensors",
+    "model.layers.30.post_attention_layernorm.weight": "model-00005-of-00006.safetensors",
+    "model.layers.30.self_attn.W_pack.weight": "model-00005-of-00006.safetensors",
+    "model.layers.30.self_attn.conv_k": "model-00005-of-00006.safetensors",
+    "model.layers.30.self_attn.conv_v": "model-00005-of-00006.safetensors",
+    "model.layers.30.self_attn.o_proj.weight": "model-00005-of-00006.safetensors",
+    "model.layers.31.input_layernorm.weight": "model-00005-of-00006.safetensors",
+    "model.layers.31.mlp.down_proj.weight": "model-00005-of-00006.safetensors",
+    "model.layers.31.mlp.gate_proj.weight": "model-00005-of-00006.safetensors",
+    "model.layers.31.mlp.up_proj.weight": "model-00005-of-00006.safetensors",
+    "model.layers.31.post_attention_layernorm.weight": "model-00005-of-00006.safetensors",
+    "model.layers.31.self_attn.W_pack.weight": "model-00005-of-00006.safetensors",
+    "model.layers.31.self_attn.conv_k": "model-00005-of-00006.safetensors",
+    "model.layers.31.self_attn.conv_v": "model-00005-of-00006.safetensors",
+    "model.layers.31.self_attn.o_proj.weight": "model-00005-of-00006.safetensors",
+    "model.layers.32.input_layernorm.weight": "model-00005-of-00006.safetensors",
+    "model.layers.32.mlp.down_proj.weight": "model-00005-of-00006.safetensors",
+    "model.layers.32.mlp.gate_proj.weight": "model-00005-of-00006.safetensors",
+    "model.layers.32.mlp.up_proj.weight": "model-00005-of-00006.safetensors",
+    "model.layers.32.post_attention_layernorm.weight": "model-00005-of-00006.safetensors",
+    "model.layers.32.self_attn.W_pack.weight": "model-00005-of-00006.safetensors",
+    "model.layers.32.self_attn.conv_k": "model-00005-of-00006.safetensors",
+    "model.layers.32.self_attn.conv_v": "model-00005-of-00006.safetensors",
+    "model.layers.32.self_attn.o_proj.weight": "model-00005-of-00006.safetensors",
+    "model.layers.33.input_layernorm.weight": "model-00005-of-00006.safetensors",
+    "model.layers.33.mlp.down_proj.weight": "model-00005-of-00006.safetensors",
+    "model.layers.33.mlp.gate_proj.weight": "model-00005-of-00006.safetensors",
+    "model.layers.33.mlp.up_proj.weight": "model-00005-of-00006.safetensors",
+    "model.layers.33.post_attention_layernorm.weight": "model-00005-of-00006.safetensors",
+    "model.layers.33.self_attn.W_pack.weight": "model-00005-of-00006.safetensors",
+    "model.layers.33.self_attn.conv_k": "model-00005-of-00006.safetensors",
+    "model.layers.33.self_attn.conv_v": "model-00005-of-00006.safetensors",
+    "model.layers.33.self_attn.o_proj.weight": "model-00005-of-00006.safetensors",
+    "model.layers.34.input_layernorm.weight": "model-00005-of-00006.safetensors",
+    "model.layers.34.mlp.down_proj.weight": "model-00005-of-00006.safetensors",
+    "model.layers.34.mlp.gate_proj.weight": "model-00005-of-00006.safetensors",
+    "model.layers.34.mlp.up_proj.weight": "model-00005-of-00006.safetensors",
+    "model.layers.34.post_attention_layernorm.weight": "model-00005-of-00006.safetensors",
+    "model.layers.34.self_attn.W_pack.weight": "model-00005-of-00006.safetensors",
+    "model.layers.34.self_attn.conv_k": "model-00005-of-00006.safetensors",
+    "model.layers.34.self_attn.conv_v": "model-00005-of-00006.safetensors",
+    "model.layers.34.self_attn.o_proj.weight": "model-00005-of-00006.safetensors",
+    "model.layers.35.input_layernorm.weight": "model-00006-of-00006.safetensors",
+    "model.layers.35.mlp.down_proj.weight": "model-00006-of-00006.safetensors",
+    "model.layers.35.mlp.gate_proj.weight": "model-00005-of-00006.safetensors",
+    "model.layers.35.mlp.up_proj.weight": "model-00006-of-00006.safetensors",
+    "model.layers.35.post_attention_layernorm.weight": "model-00006-of-00006.safetensors",
+    "model.layers.35.self_attn.W_pack.weight": "model-00005-of-00006.safetensors",
+    "model.layers.35.self_attn.conv_k": "model-00005-of-00006.safetensors",
+    "model.layers.35.self_attn.conv_v": "model-00005-of-00006.safetensors",
+    "model.layers.35.self_attn.o_proj.weight": "model-00005-of-00006.safetensors",
+    "model.layers.36.input_layernorm.weight": "model-00006-of-00006.safetensors",
+    "model.layers.36.mlp.down_proj.weight": "model-00006-of-00006.safetensors",
+    "model.layers.36.mlp.gate_proj.weight": "model-00006-of-00006.safetensors",
+    "model.layers.36.mlp.up_proj.weight": "model-00006-of-00006.safetensors",
+    "model.layers.36.post_attention_layernorm.weight": "model-00006-of-00006.safetensors",
+    "model.layers.36.self_attn.W_pack.weight": "model-00006-of-00006.safetensors",
+    "model.layers.36.self_attn.conv_k": "model-00006-of-00006.safetensors",
+    "model.layers.36.self_attn.conv_v": "model-00006-of-00006.safetensors",
+    "model.layers.36.self_attn.o_proj.weight": "model-00006-of-00006.safetensors",
+    "model.layers.37.input_layernorm.weight": "model-00006-of-00006.safetensors",
+    "model.layers.37.mlp.down_proj.weight": "model-00006-of-00006.safetensors",
+    "model.layers.37.mlp.gate_proj.weight": "model-00006-of-00006.safetensors",
+    "model.layers.37.mlp.up_proj.weight": "model-00006-of-00006.safetensors",
+    "model.layers.37.post_attention_layernorm.weight": "model-00006-of-00006.safetensors",
+    "model.layers.37.self_attn.W_pack.weight": "model-00006-of-00006.safetensors",
+    "model.layers.37.self_attn.conv_k": "model-00006-of-00006.safetensors",
+    "model.layers.37.self_attn.conv_v": "model-00006-of-00006.safetensors",
+    "model.layers.37.self_attn.o_proj.weight": "model-00006-of-00006.safetensors",
+    "model.layers.38.input_layernorm.weight": "model-00006-of-00006.safetensors",
+    "model.layers.38.mlp.down_proj.weight": "model-00006-of-00006.safetensors",
+    "model.layers.38.mlp.gate_proj.weight": "model-00006-of-00006.safetensors",
+    "model.layers.38.mlp.up_proj.weight": "model-00006-of-00006.safetensors",
+    "model.layers.38.post_attention_layernorm.weight": "model-00006-of-00006.safetensors",
+    "model.layers.38.self_attn.W_pack.weight": "model-00006-of-00006.safetensors",
+    "model.layers.38.self_attn.conv_k": "model-00006-of-00006.safetensors",
+    "model.layers.38.self_attn.conv_v": "model-00006-of-00006.safetensors",
+    "model.layers.38.self_attn.o_proj.weight": "model-00006-of-00006.safetensors",
+    "model.layers.39.input_layernorm.weight": "model-00006-of-00006.safetensors",
+    "model.layers.39.mlp.down_proj.weight": "model-00006-of-00006.safetensors",
+    "model.layers.39.mlp.gate_proj.weight": "model-00006-of-00006.safetensors",
+    "model.layers.39.mlp.up_proj.weight": "model-00006-of-00006.safetensors",
+    "model.layers.39.post_attention_layernorm.weight": "model-00006-of-00006.safetensors",
+    "model.layers.39.self_attn.W_pack.weight": "model-00006-of-00006.safetensors",
+    "model.layers.39.self_attn.conv_k": "model-00006-of-00006.safetensors",
+    "model.layers.39.self_attn.conv_v": "model-00006-of-00006.safetensors",
+    "model.layers.39.self_attn.o_proj.weight": "model-00006-of-00006.safetensors",
+    "model.layers.4.input_layernorm.weight": "model-00001-of-00006.safetensors",
+    "model.layers.4.mlp.down_proj.weight": "model-00001-of-00006.safetensors",
+    "model.layers.4.mlp.gate_proj.weight": "model-00001-of-00006.safetensors",
+    "model.layers.4.mlp.up_proj.weight": "model-00001-of-00006.safetensors",
+    "model.layers.4.post_attention_layernorm.weight": "model-00001-of-00006.safetensors",
+    "model.layers.4.self_attn.W_pack.weight": "model-00001-of-00006.safetensors",
+    "model.layers.4.self_attn.conv_k": "model-00001-of-00006.safetensors",
+    "model.layers.4.self_attn.conv_v": "model-00001-of-00006.safetensors",
+    "model.layers.4.self_attn.o_proj.weight": "model-00001-of-00006.safetensors",
+    "model.layers.5.input_layernorm.weight": "model-00002-of-00006.safetensors",
+    "model.layers.5.mlp.down_proj.weight": "model-00002-of-00006.safetensors",
+    "model.layers.5.mlp.gate_proj.weight": "model-00001-of-00006.safetensors",
+    "model.layers.5.mlp.up_proj.weight": "model-00002-of-00006.safetensors",
+    "model.layers.5.post_attention_layernorm.weight": "model-00002-of-00006.safetensors",
+    "model.layers.5.self_attn.W_pack.weight": "model-00001-of-00006.safetensors",
+    "model.layers.5.self_attn.conv_k": "model-00001-of-00006.safetensors",
+    "model.layers.5.self_attn.conv_v": "model-00001-of-00006.safetensors",
+    "model.layers.5.self_attn.o_proj.weight": "model-00001-of-00006.safetensors",
+    "model.layers.6.input_layernorm.weight": "model-00002-of-00006.safetensors",
+    "model.layers.6.mlp.down_proj.weight": "model-00002-of-00006.safetensors",
+    "model.layers.6.mlp.gate_proj.weight": "model-00002-of-00006.safetensors",
+    "model.layers.6.mlp.up_proj.weight": "model-00002-of-00006.safetensors",
+    "model.layers.6.post_attention_layernorm.weight": "model-00002-of-00006.safetensors",
+    "model.layers.6.self_attn.W_pack.weight": "model-00002-of-00006.safetensors",
+    "model.layers.6.self_attn.conv_k": "model-00002-of-00006.safetensors",
+    "model.layers.6.self_attn.conv_v": "model-00002-of-00006.safetensors",
+    "model.layers.6.self_attn.o_proj.weight": "model-00002-of-00006.safetensors",
+    "model.layers.7.input_layernorm.weight": "model-00002-of-00006.safetensors",
+    "model.layers.7.mlp.down_proj.weight": "model-00002-of-00006.safetensors",
+    "model.layers.7.mlp.gate_proj.weight": "model-00002-of-00006.safetensors",
+    "model.layers.7.mlp.up_proj.weight": "model-00002-of-00006.safetensors",
+    "model.layers.7.post_attention_layernorm.weight": "model-00002-of-00006.safetensors",
+    "model.layers.7.self_attn.W_pack.weight": "model-00002-of-00006.safetensors",
+    "model.layers.7.self_attn.conv_k": "model-00002-of-00006.safetensors",
+    "model.layers.7.self_attn.conv_v": "model-00002-of-00006.safetensors",
+    "model.layers.7.self_attn.o_proj.weight": "model-00002-of-00006.safetensors",
+    "model.layers.8.input_layernorm.weight": "model-00002-of-00006.safetensors",
+    "model.layers.8.mlp.down_proj.weight": "model-00002-of-00006.safetensors",
+    "model.layers.8.mlp.gate_proj.weight": "model-00002-of-00006.safetensors",
+    "model.layers.8.mlp.up_proj.weight": "model-00002-of-00006.safetensors",
+    "model.layers.8.post_attention_layernorm.weight": "model-00002-of-00006.safetensors",
+    "model.layers.8.self_attn.W_pack.weight": "model-00002-of-00006.safetensors",
+    "model.layers.8.self_attn.conv_k": "model-00002-of-00006.safetensors",
+    "model.layers.8.self_attn.conv_v": "model-00002-of-00006.safetensors",
+    "model.layers.8.self_attn.o_proj.weight": "model-00002-of-00006.safetensors",
+    "model.layers.9.input_layernorm.weight": "model-00002-of-00006.safetensors",
+    "model.layers.9.mlp.down_proj.weight": "model-00002-of-00006.safetensors",
+    "model.layers.9.mlp.gate_proj.weight": "model-00002-of-00006.safetensors",
+    "model.layers.9.mlp.up_proj.weight": "model-00002-of-00006.safetensors",
+    "model.layers.9.post_attention_layernorm.weight": "model-00002-of-00006.safetensors",
+    "model.layers.9.self_attn.W_pack.weight": "model-00002-of-00006.safetensors",
+    "model.layers.9.self_attn.conv_k": "model-00002-of-00006.safetensors",
+    "model.layers.9.self_attn.conv_v": "model-00002-of-00006.safetensors",
+    "model.layers.9.self_attn.o_proj.weight": "model-00002-of-00006.safetensors",
+    "model.norm.weight": "model-00006-of-00006.safetensors"
+  }
+}
diff --git a/modeling_baichuan.py b/modeling_baichuan.py
new file mode 100644
index 0000000..6814764
--- /dev/null
+++ b/modeling_baichuan.py
@@ -0,0 +1,1197 @@
+import math
+import os
+from typing import List, Optional, Tuple, Union, Dict, Any
+
+import torch
+import torch.nn.functional as F
+import torch.utils.checkpoint
+from einops import rearrange
+from torch import nn
+from torch.nn import CrossEntropyLoss
+from transformers import add_start_docstrings, PreTrainedModel, DynamicCache, \
+    GenerationMixin, StaticCache, GenerationConfig
+from transformers.activations import ACT2FN
+from transformers.modeling_attn_mask_utils import AttentionMaskConverter
+from transformers.modeling_flash_attention_utils import _flash_supports_window_size, \
+    _upad_input
+from transformers.modeling_outputs import BaseModelOutputWithPast, CausalLMOutputWithPast
+from transformers.utils import is_flash_attn_2_available, is_flash_attn_greater_or_equal_2_10, \
+    add_start_docstrings_to_model_forward, is_torchdynamo_compiling, logging, \
+    is_flash_attn_greater_or_equal
+
+if is_flash_attn_2_available():
+    from flash_attn.bert_padding import  pad_input
+    from flash_attn import flash_attn_func, flash_attn_varlen_func
+    from flash_attn.layers.rotary import apply_rotary_emb_func
+from .configuration_baichuan import BaichuanM1Config
+
+logger = logging.get_logger(__name__)
+
+
+class CustomCache(DynamicCache):
+    def __init__(self):
+        super().__init__()
+        self.past_len = []
+
+    def get_past_len(self, layer_idx: Optional[int] = 0) -> int:
+        if len(self.past_len) <= layer_idx:
+            return 0
+        return self.past_len[layer_idx]
+
+    def get_seq_length(self, layer_idx: Optional[int] = 0) -> int:
+        """Returns the sequence length of the cached states. A layer index can be optionally passed."""
+        # TODO: deprecate this function in favor of `cache_position`
+        if len(self.key_cache) <= layer_idx:
+            return 0
+        return self.key_cache[layer_idx].shape[1]
+
+    def update(
+            self,
+            key_states: torch.Tensor,
+            value_states: torch.Tensor,
+            layer_idx: int,
+            cache_kwargs: Optional[Dict[str, Any]] = None,
+    ) -> Tuple[torch.Tensor, torch.Tensor]:
+        """
+        Updates the cache with the new `key_states` and `value_states` for the layer `layer_idx`.
+
+        Parameters:
+            key_states (`torch.Tensor`):
+                The new key states to cache.
+            value_states (`torch.Tensor`):
+                The new value states to cache.
+            layer_idx (`int`):
+                The index of the layer to cache the states for.
+            cache_kwargs (`Dict[str, Any]`, `optional`):
+                Additional arguments for the cache subclass. No additional arguments are used in `DynamicCache`.
+
+        Return:
+            A tuple containing the updated key and value states.
+        """
+        # Update the number of seen tokens
+        if layer_idx == 0:
+            self._seen_tokens += key_states.shape[1]
+
+        # Update the cache
+        if len(self.key_cache) <= layer_idx:
+            self.key_cache.append(key_states)
+            self.value_cache.append(value_states)
+        else:
+            self.key_cache[layer_idx] = torch.cat([self.key_cache[layer_idx], key_states], dim=1)
+            self.value_cache[layer_idx] = torch.cat([self.value_cache[layer_idx], value_states], dim=1)
+
+        if len(self.past_len) <= layer_idx:
+            self.past_len.append(key_states.shape[1])
+        else:
+            self.past_len[layer_idx] += key_states.shape[1]
+
+        return self.key_cache[layer_idx], self.value_cache[layer_idx]
+
+
+def _prepare_4d_causal_attention_mask_with_cache_position(
+        attention_mask: torch.Tensor,
+        sequence_length: int,
+        target_length: int,
+        dtype: torch.dtype,
+        device: torch.device,
+        min_dtype: float,
+        cache_position: torch.Tensor,
+        batch_size: int,
+):
+    """
+    Creates a causal 4D mask of shape `(batch_size, 1, query_length, key_value_length)` from a 2D mask of shape
+    `(batch_size, key_value_length)`, or if the input `attention_mask` is already 4D, do nothing.
+
+    Args:
+        attention_mask (`torch.Tensor`):
+            A 2D attention mask of shape `(batch_size, key_value_length)` or a 4D attention mask of shape `(batch_size, 1, query_length, key_value_length)`.
+        sequence_length (`int`):
+            The sequence length being processed.
+        target_length (`int`):
+            The target length: when generating with static cache, the mask should be as long as the static cache, to account for the 0 padding, the part of the cache that is not filled yet.
+        dtype (`torch.dtype`):
+            The dtype to use for the 4D attention mask.
+        device (`torch.device`):
+            The device to plcae the 4D attention mask on.
+        min_dtype (`float`):
+            The minimum value representable with the dtype `dtype`.
+        cache_position (`torch.Tensor`):
+            Indices depicting the position of the input sequence tokens in the sequence.
+        batch_size (`torch.Tensor`):
+            Batch size.
+    """
+    if attention_mask is not None and attention_mask.dim() == 4:
+        # In this case we assume that the mask comes already in inverted form and requires no inversion or slicing.
+        causal_mask = attention_mask
+    else:
+        causal_mask = torch.full((sequence_length, target_length), fill_value=min_dtype, dtype=dtype, device=device)
+        if sequence_length != 1:
+            causal_mask = torch.triu(causal_mask, diagonal=1)
+        causal_mask *= torch.arange(target_length, device=device) > cache_position.reshape(-1, 1)
+        causal_mask = causal_mask[None, None, :, :].expand(batch_size, 1, -1, -1)
+        if attention_mask is not None:
+            causal_mask = causal_mask.clone()  # copy to contiguous memory for in-place edit
+            mask_length = attention_mask.shape[-1]
+            padding_mask = causal_mask[:, :, :, :mask_length] + attention_mask[:, None, None, :]
+            padding_mask = padding_mask == 0
+            causal_mask[:, :, :, :mask_length] = causal_mask[:, :, :, :mask_length].masked_fill(
+                padding_mask, min_dtype
+            )
+
+    return causal_mask
+
+
+class BaichuanRMSNorm(nn.Module):
+    def __init__(self, hidden_size, eps=1e-6):
+        """
+        RMSNorm is equivalent to T5LayerNorm
+        """
+        super().__init__()
+        self.weight = nn.Parameter(torch.ones(hidden_size))
+        self.variance_epsilon = eps
+
+    def forward(self, hidden_states):
+        variance = hidden_states.to(torch.float32).pow(2).mean(-1, keepdim=True)
+        hidden_states = hidden_states * torch.rsqrt(variance + self.variance_epsilon)
+
+        # convert into half-precision if necessary
+        if self.weight.dtype in [torch.float16, torch.bfloat16]:
+            hidden_states = hidden_states.to(self.weight.dtype)
+
+        return self.weight * hidden_states
+
+
+class RotaryEmbedding(torch.nn.Module):
+    def __init__(self, dim, max_position_embeddings=2048, base=1e5, device=None, interleaved=False):
+        super().__init__()
+        self.inv_freq = 1.0 / (base ** (torch.arange(0, dim, 2).float() / dim))
+        self.base = base
+        self.dim = dim
+        # Build here to make `torch.jit.trace` work.
+        self.max_seq_len_cached = 0
+        self.interleaved = interleaved
+
+    def forward(self, q, k, seqlen_offset=None, cu_seqlens=None, max_seqlen=None):
+        # x: [bs, num_attention_heads, seq_len, head_size]
+        # This `if` block is unlikely to be run after we build sin/cos in `__init__`. Keep the logic here just in case.
+        seq_len_dim = 1
+        seq_len = q.shape[seq_len_dim] + seqlen_offset
+        if seq_len > self.max_seq_len_cached:
+            self.max_seq_len_cached = seq_len
+            self.inv_freq = 1.0 / (
+                    self.base ** (torch.arange(0, self.dim, 2).float().to(self.inv_freq.device) / self.dim))
+            t = torch.arange(self.max_seq_len_cached, device=self.inv_freq.device, dtype=self.inv_freq.dtype)
+            # freqs = torch.einsum("i,j->ij", t, self.inv_freq) # dont use this, bug in fp16
+            freqs = torch.outer(t, self.inv_freq)
+            self.cos_cached = freqs.cos().to(q.device)
+            self.sin_cached = freqs.sin().to(k.device)
+        q_ori_size = q.size()
+        k_ori_size = k.size()
+        if cu_seqlens is not None:
+            q = flatten_one_dim(q)
+            k = flatten_one_dim(k)
+        q_new = apply_rotary_emb_func(
+            q.float(), self.cos_cached[seqlen_offset:], self.sin_cached[seqlen_offset:],
+            self.interleaved, True,  # inplace=True
+            cu_seqlens=cu_seqlens,
+            max_seqlen=max_seqlen
+        ).to(q.dtype)
+        k_new = apply_rotary_emb_func(
+            k.float(), self.cos_cached[seqlen_offset:], self.sin_cached[seqlen_offset:],
+            self.interleaved, True,
+            cu_seqlens=cu_seqlens,
+            max_seqlen=max_seqlen
+        ).to(k.dtype)
+        if cu_seqlens is not None:
+            q_new = q_new.reshape(*q_ori_size)
+            k_new = k_new.reshape(*k_ori_size)
+        return q_new, k_new
+
+
+class BaichuanMLP(nn.Module):
+    def __init__(self, config):
+        super().__init__()
+        self.hidden_size = config.hidden_size
+        self.intermediate_size = config.intermediate_size
+        self.gate_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=False)
+        self.up_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=False)
+        self.down_proj = nn.Linear(self.intermediate_size, self.hidden_size, bias=False)
+        self.act_fn = ACT2FN[config.hidden_act]
+
+    def forward(self, hidden_state):
+        return self.down_proj(self.act_fn(self.gate_proj(hidden_state)) * self.up_proj(hidden_state))
+
+
+class BaichuanAttention(nn.Module):
+    """
+    Multi-headed attention from 'Attention Is All You Need' paper. Modified to use sliding window attention: Longformer
+    and "Generating Long Sequences with Sparse Transformers".
+    """
+
+    def __init__(self, config: BaichuanM1Config, layer_idx: Optional[int] = None):
+        super().__init__()
+        self.config = config
+        self.layer_idx = layer_idx
+        if layer_idx is None:
+            raise ValueError(
+                f"Instantiating {self.__class__.__name__} without passing `layer_idx` is not recommended and will "
+                "to errors during the forward call, if caching is used. Please make sure to provide a `layer_idx` "
+                "when creating this class."
+            )
+
+        self.hidden_size = config.hidden_size
+        self.is_swa = layer_idx in self.config.sliding_window_layers
+        self.num_heads = config.num_swa_attention_heads if self.is_swa else config.num_attention_heads
+        self.head_dim = self.hidden_size // self.num_heads
+        self.num_key_value_heads = config.num_swa_key_value_heads if self.is_swa else config.num_key_value_heads
+        self.max_position_embeddings = config.max_position_embeddings
+        self.rope_theta = config.rope_theta
+        self.is_causal = True
+        self.attention_dropout = config.attention_dropout
+
+        if (self.head_dim * self.num_heads) != self.hidden_size:
+            raise ValueError(
+                f"hidden_size must be divisible by num_heads (got `hidden_size`: {self.hidden_size}"
+                f" and `num_heads`: {self.num_heads})."
+            )
+        self.W_pack = nn.Linear(config.hidden_size, self.hidden_size + 2 * self.num_key_value_heads * self.head_dim,
+                                bias=False)
+        self.o_proj = nn.Linear(self.num_heads * self.head_dim, self.hidden_size, bias=False)
+
+        self.rotary_emb = RotaryEmbedding(dim=self.head_dim, base=self.config.rope_theta,
+                                          max_position_embeddings=self.config.max_position_embeddings)
+        self.conv_window = config.conv_window
+        assert self.conv_window == 2 #%% Currently, only supported window=2 when inference
+        self.conv_k = nn.Parameter(torch.softmax(torch.randn((1, 1, self.num_key_value_heads, 1, self.conv_window)), dim=-1))
+        self.conv_v = nn.Parameter(torch.softmax(torch.randn((1, 1, self.num_key_value_heads, 1, self.conv_window)), dim=-1))
+        self.last_k, self.last_v = None, None
+
+
+def get_max_seqlen(cu_seqlens):
+    max_seqlen = (cu_seqlens[1:] - cu_seqlens[:-1]).max().item()
+    return max_seqlen
+
+
+def flatten_one_dim(tensor):
+    tensor = tensor.view(-1, tensor.size(-2), tensor.size(-1))
+    return tensor
+
+
+def prepare_for_flash_attention_varlen(query, key, value, cu_seqlens):
+    query = query.view(-1, query.size(-2), query.size(-1))
+    key = key.view(-1, key.size(-2), key.size(-1))
+    value = value.view(-1, value.size(-2), value.size(-1))
+    return query, key, value, get_max_seqlen(cu_seqlens)
+
+
+def flash_attention_forward(
+        query_states: torch.Tensor,
+        key_states: torch.Tensor,
+        value_states: torch.Tensor,
+        query_length: int,
+        is_causal: bool,
+        dropout: float = 0.0,
+        attention_mask: Optional[torch.Tensor] = None,
+        position_ids: Optional[torch.Tensor] = None,
+        seqlens: Optional[torch.LongTensor] = None,
+        softmax_scale: Optional[float] = None,
+        sliding_window: Optional[int] = None,
+        use_top_left_mask: bool = False,
+        softcap: Optional[float] = None,
+        deterministic: bool = None,
+):
+    """
+    Calls the forward method of Flash Attention - if the input hidden states contain at least one padding token
+    first unpad the input, then computes the attention scores and pad the final attention scores.
+
+    Args:
+        query_states (`torch.Tensor`):
+            Input query states to be passed to Flash Attention API
+        key_states (`torch.Tensor`):
+            Input key states to be passed to Flash Attention API
+        value_states (`torch.Tensor`):
+            Input value states to be passed to Flash Attention API
+        attention_mask (`torch.Tensor`):
+            The padding mask - corresponds to a tensor of size `(batch_size, seq_len)` where 0 stands for the
+            position of padding tokens and 1 for the position of non-padding tokens.
+        dropout (`float`):
+            Attention dropout
+        softmax_scale (`float`, *optional*):
+            The scaling of QK^T before applying softmax. Default to 1 / sqrt(head_dim)
+        use_top_left_mask (`bool`, defaults to `False`):
+            flash_attn<2.1 generates top-left aligned causal mask, while what is needed here is bottom-right alignement, that was made default for flash_attn>=2.1. This attribute is used to handle this difference.
+        softcap (`float`, *optional*):
+            Softcap for the attention logits, used e.g. in gemma2.
+        deterministic (`bool`, *optional*):
+            Determines if the deterministic option introduced in flash_attn>=2.4.1 is enabled.
+    """
+    if not use_top_left_mask:
+        causal = is_causal
+    else:
+        # TODO: Remove the `query_length != 1` check once Flash Attention for RoCm is bumped to 2.1. .
+        causal = is_causal and query_length != 1
+
+    # Assuming 4D tensors, key_states.shape[1] is the key/value sequence length (source length).
+    use_sliding_windows = (
+            _flash_supports_window_size and sliding_window is not None and key_states.shape[1] > sliding_window
+    )
+    flash_kwargs = {"window_size": (sliding_window - 1, 0)} if use_sliding_windows else {}
+
+    if is_flash_attn_greater_or_equal("2.4.1"):
+        if deterministic is None:
+            deterministic = os.environ.get("FLASH_ATTENTION_DETERMINISTIC", "0") == "1"
+        flash_kwargs["deterministic"] = deterministic
+
+    if softcap is not None:
+        flash_kwargs["softcap"] = softcap
+    # Contains at least one padding token in the sequence
+    if seqlens is not None:
+        batch_size = query_states.shape[0]
+        query_states, key_states, value_states, max_seqlen = prepare_for_flash_attention_varlen(query_states,
+                                                                                                key_states,
+                                                                                                value_states, seqlens)
+        attn_output = flash_attn_varlen_func(
+            query_states,
+            key_states,
+            value_states,
+            cu_seqlens_q=seqlens,
+            cu_seqlens_k=seqlens,
+            max_seqlen_q=max_seqlen,
+            max_seqlen_k=max_seqlen,
+            dropout_p=dropout,
+            softmax_scale=softmax_scale,
+            causal=causal,
+            **flash_kwargs,
+        )
+
+        attn_output = attn_output.reshape(batch_size, -1, attn_output.size(-2), attn_output.size(-1))
+
+    elif attention_mask is not None:
+        batch_size = query_states.shape[0]
+        query_states, key_states, value_states, indices_q, cu_seq_lens, max_seq_lens = _upad_input(
+            query_states, key_states, value_states, attention_mask, query_length
+        )
+        cu_seqlens_q, cu_seqlens_k = cu_seq_lens
+        max_seqlen_in_batch_q, max_seqlen_in_batch_k = max_seq_lens
+        attn_output_unpad = flash_attn_varlen_func(
+            query_states,
+            key_states,
+            value_states,
+            cu_seqlens_q=cu_seqlens_q,
+            cu_seqlens_k=cu_seqlens_k,
+            max_seqlen_q=max_seqlen_in_batch_q,
+            max_seqlen_k=max_seqlen_in_batch_k,
+            dropout_p=dropout,
+            softmax_scale=softmax_scale,
+            causal=causal,
+            **flash_kwargs,
+        )
+        attn_output = pad_input(attn_output_unpad, indices_q, batch_size, query_length)
+
+    else:
+        attn_output = flash_attn_func(
+            query_states, key_states, value_states, dropout, softmax_scale=softmax_scale, causal=causal, **flash_kwargs
+        )
+
+    return attn_output
+
+
+def custom_convolution(U, K): 
+    """
+    U: Input matrix, shape (bs, seq, h, d)
+    K: Convolution kernel, shape (w, h)
+    Returns: Output matrix V, shape (bs, seq, h, d)
+    """
+    # h, w = K.shape
+    w = K.size(-1)
+    padding = (w - 1, 0)
+    U_padded = F.pad(U, (0, 0, 0, 0, *padding))  # Shape becomes (bs, seq+w-1, h, d)
+    U_unfolded = U_padded.unfold(1, w, 1)  # Shape becomes (bs, seq+w-1, h, d, w)
+    V_unfolded = U_unfolded * K  # Shape remains (bs, seq, h, d, w)
+    V = V_unfolded.sum(dim=-1)  # Shape becomes (bs, seq, h, d)
+    return V
+
+
+def custom_convolution_with_splits(U, K, cu_seqlens):
+    """
+    U: Input matrix, shape (bs, seq, h, d)
+    K: Convolution kernel, shape (w, h)
+    cu_seqlens: Cumulative sequence lengths, indicating how to split the input.
+    Returns: Output matrix, shape (bs, seq, h, d)
+    """
+    ori_shape = U.size()  # Save the original shape of U
+    # Flatten U to handle variable-length sequences
+    U_flatten = U.reshape(1, -1, ori_shape[-2], ori_shape[-1])  # Shape: (1, total_seq, h, d)
+
+    # Perform convolution on each subsequence separately
+    V_parts = []  # Store the results of each subsequence
+    start = 0  # Start index of the current subsequence
+    for end in cu_seqlens[1:]:
+        end = end.item()  # Convert scalar tensor to int
+        U_part = U_flatten[:, start:end, :, :]  # Slice the subsequence (1, seq_sub, h, d)
+        V_part = custom_convolution(U_part, K)  # Apply custom convolution
+        V_parts.append(V_part)  # Append the result
+        start = end  # Update the start index for the next subsequence
+
+    # Concatenate the results along the sequence dimension
+    V = torch.cat(V_parts, dim=1).to(U)  # Shape: (1, total_seq, h, d)
+
+    # Reshape the output to match the original input shape
+    return V.reshape(ori_shape)
+
+
+class BaichuanFlashAttention2(BaichuanAttention):
+    """
+    Baichuan flash attention module, following Baichuan attention module. This module inherits from `BaichuanAttention`
+    as the weights of the module stays untouched. The only required change would be on the forward pass
+    where it needs to correctly call the public API of flash attention and deal with padding tokens
+    in case the input contains any of them. 
+    """
+
+    def __init__(self, *args, **kwargs):
+        super().__init__(*args, **kwargs)
+
+        # TODO: Should be removed once Flash Attention for RoCm is bumped to 2.1.
+        # flash_attn<2.1 generates top-left aligned causal mask, while what is needed here is bottom-right alignement, that was made default for flash_attn>=2.1. This attribute is used to handle this difference. Reference: https://github.com/Dao-AILab/flash-attention/releases/tag/v2.1.0.
+        # Beware that with flash_attn<2.1, using q_seqlen != k_seqlen (except for the case q_seqlen == 1) produces a wrong mask (top-left).
+        self._flash_attn_uses_top_left_mask = not is_flash_attn_greater_or_equal_2_10()
+
+    def forward(
+            self,
+            hidden_states: torch.Tensor,
+            attention_mask: Optional[torch.Tensor] = None,
+            position_ids: Optional[torch.LongTensor] = None,
+            seqlens: Optional[torch.LongTensor] = None,
+            past_key_value: Optional[CustomCache] = None,
+            output_attentions: bool = False,
+            use_cache: bool = False,
+            cache_position: Optional[torch.LongTensor] = None,
+            position_embeddings: Optional[Tuple[torch.Tensor, torch.Tensor]] = None,  # will become mandatory in v4.46
+    ):
+
+        bsz, q_len, _ = hidden_states.size()
+        proj = self.W_pack(hidden_states)
+        proj = rearrange(proj, 'bs seq_len (n_head head_dim) -> n_head bs seq_len head_dim', head_dim=self.head_dim)
+        query_states = rearrange(proj[:self.num_heads], 'n_head bs seq_len head_dim -> bs seq_len n_head head_dim')
+        key_states = rearrange(proj[self.num_heads:self.num_heads + self.num_key_value_heads],
+                               'n_head bs seq_len head_dim -> bs seq_len n_head head_dim')
+        value_states = rearrange(proj[self.num_heads + self.num_key_value_heads:],
+                                 'n_head bs seq_len head_dim -> bs seq_len n_head head_dim')
+
+
+        if past_key_value is None or past_key_value.get_seq_length(self.layer_idx) == 0:# prefill
+            if not self.training:
+                self.last_k = key_states[:, -1:]
+                self.last_v = value_states[:, -1:]
+            if seqlens is None:
+                key_states = custom_convolution(key_states, self.conv_k)
+                value_states = custom_convolution(value_states, self.conv_v)
+            else:
+                assert seqlens.ndim==1
+                key_states=custom_convolution_with_splits(key_states,self.conv_k,seqlens)
+                value_states=custom_convolution_with_splits(value_states,self.conv_v,seqlens)
+        else: # decode
+            self.last_k, key_states = key_states, self.conv_k[0, 0, :, 0, :1] * self.last_k + self.conv_k[0, 0, :, 0, 1:] * key_states
+            self.last_v, value_states = value_states, self.conv_v[0, 0, :, 0, :1] * self.last_v + self.conv_v[0, 0, :, 0, 1:] * value_states
+        if seqlens is not None:
+            max_seqlen = get_max_seqlen(seqlens)
+        else:
+            max_seqlen = None
+
+        past_len = past_key_value.get_past_len(self.layer_idx) if past_key_value is not None else 0
+        query_states, key_states = self.rotary_emb(
+            query_states,
+            key_states,
+            seqlen_offset=past_len,
+            cu_seqlens=seqlens,
+            max_seqlen=max_seqlen
+        )
+
+        if past_key_value is not None:
+            cache_has_contents = past_key_value.get_seq_length(self.layer_idx) > 0
+            kv_seq_len = key_states.shape[1] + past_key_value.get_seq_length(self.layer_idx)
+            if (
+                    self.is_swa
+                    and kv_seq_len > self.config.sliding_window
+                    and cache_has_contents
+            ):
+                slicing_tokens = 1 - self.config.sliding_window
+                past_key = past_key_value[self.layer_idx][0]
+                past_value = past_key_value[self.layer_idx][1]
+
+                past_key_value.key_cache[self.layer_idx] = past_key[:, slicing_tokens:, :, :].contiguous()
+                past_key_value.value_cache[self.layer_idx] = past_value[:, slicing_tokens:, :, :].contiguous()
+
+                if past_key_value[self.layer_idx][0].shape[1] != self.config.sliding_window - 1:
+                    raise ValueError(
+                        f"past key must have a shape of (`batch_size, num_heads, self.config.sliding_window-1, head_dim`), got"
+                        f" {past_key.shape}"
+                    )
+
+                # if attention_mask is not None:
+                #     # TODO: not check!!
+                #     attention_mask = attention_mask[:, slicing_tokens:]
+                #     attention_mask = torch.cat([attention_mask, torch.ones_like(attention_mask[:, -1:])], dim=-1)
+            key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx)
+
+        input_dtype = query_states.dtype
+        if input_dtype == torch.float32:
+            if torch.is_autocast_enabled():
+                target_dtype = torch.get_autocast_gpu_dtype()
+            # Handle the case where the model is quantized
+            elif hasattr(self.config, "_pre_quantization_dtype"):
+                target_dtype = self.config._pre_quantization_dtype
+            else:
+                target_dtype = self.q_proj.weight.dtype
+
+            logger.warning_once(
+                f"The input hidden states seems to be silently casted in float32, this might be related to"
+                f" the fact you have upcasted embedding or layer norm layers in float32. We will cast back the input in"
+                f" {target_dtype}."
+            )
+
+            query_states = query_states.to(target_dtype)
+            key_states = key_states.to(target_dtype)
+            value_states = value_states.to(target_dtype)
+
+        if self.is_swa:
+            sliding_window = self.config.sliding_window
+        else:
+            sliding_window = None
+        attn_output = flash_attention_forward(
+            query_states,
+            key_states,
+            value_states,
+            query_length=q_len,
+            position_ids=position_ids,
+            seqlens=seqlens,
+            sliding_window=sliding_window,
+            is_causal=self.is_causal,
+            use_top_left_mask=self._flash_attn_uses_top_left_mask,
+        )
+        attn_output = attn_output.reshape(bsz, q_len, self.hidden_size).contiguous()
+        attn_output = self.o_proj(attn_output)
+
+        if not output_attentions:
+            attn_weights = None
+
+        return attn_output, attn_weights, past_key_value
+
+
+Baichuan_ATTENTION_CLASSES = {
+    "eager": BaichuanAttention,
+    "flash_attention_2": BaichuanFlashAttention2,
+}
+
+
+class BaichuanDecoderLayer(nn.Module):
+    def __init__(self, config: BaichuanM1Config, layer_idx: int):
+        super().__init__()
+        self.hidden_size = config.hidden_size
+        self.layer_idx = layer_idx
+        self.self_attn = Baichuan_ATTENTION_CLASSES['flash_attention_2'](config, layer_idx)
+
+        self.mlp = BaichuanMLP(config)
+        self.input_layernorm = BaichuanRMSNorm(config.hidden_size, eps=config.rms_norm_eps)
+        self.post_attention_layernorm = BaichuanRMSNorm(config.hidden_size, eps=config.rms_norm_eps)
+
+    def forward(
+            self,
+            hidden_states: torch.Tensor,
+            attention_mask: Optional[torch.Tensor] = None,
+            position_ids: Optional[torch.LongTensor] = None,
+            seqlens: Optional[torch.LongTensor] = None,
+            past_key_value: Optional[Tuple[torch.Tensor]] = None,
+            output_attentions: Optional[bool] = False,
+            use_cache: Optional[bool] = False,
+            cache_position: Optional[torch.LongTensor] = None,
+            position_embeddings: Optional[Tuple[torch.Tensor, torch.Tensor]] = None,  
+            **kwargs,
+    ) -> Tuple[torch.FloatTensor, Optional[Tuple[torch.FloatTensor, torch.FloatTensor]]]:
+        """
+        Args:
+            hidden_states (`torch.FloatTensor`): input to the layer of shape `(batch, seq_len, embed_dim)`
+            attention_mask (`torch.FloatTensor`, *optional*): attention mask of size
+                `(batch, sequence_length)` where padding elements are indicated by 0.
+            output_attentions (`bool`, *optional*):
+                Whether or not to return the attentions tensors of all attention layers. See `attentions` under
+                returned tensors for more detail.
+            use_cache (`bool`, *optional*):
+                If set to `True`, `past_key_values` key value states are returned and can be used to speed up decoding
+                (see `past_key_values`).
+            past_key_value (`Tuple(torch.FloatTensor)`, *optional*): cached past key and value projection states
+            cache_position (`torch.LongTensor` of shape `(sequence_length)`, *optional*):
+                Indices depicting the position of the input sequence tokens in the sequence.
+            position_embeddings (`Tuple[torch.FloatTensor, torch.FloatTensor]`, *optional*):
+                Tuple containing the cosine and sine positional embeddings of shape `(batch_size, seq_len, head_dim)`,
+                with `head_dim` being the embedding dimension of each attention head.
+            kwargs (`dict`, *optional*):
+                Arbitrary kwargs to be ignored, used for FSDP and other methods that injects code
+                into the model
+        """
+
+        residual = hidden_states
+
+        hidden_states = self.input_layernorm(hidden_states)
+
+        hidden_states, self_attn_weights, present_key_value = self.self_attn(
+            hidden_states=hidden_states,
+            attention_mask=attention_mask,
+            position_ids=position_ids,
+            seqlens=seqlens,
+            past_key_value=past_key_value,
+            output_attentions=output_attentions,
+            use_cache=use_cache,
+            cache_position=cache_position,
+            position_embeddings=position_embeddings,
+        )
+        hidden_states = residual + hidden_states
+
+        # Fully Connected
+        residual = hidden_states
+        hidden_states = self.post_attention_layernorm(hidden_states)
+        hidden_states = self.mlp(hidden_states)
+        hidden_states = residual + hidden_states
+
+        outputs = (hidden_states,)
+
+        if output_attentions:
+            outputs += (self_attn_weights,)
+
+        if use_cache:
+            outputs += (present_key_value,)
+        return outputs
+
+
+Baichuan_START_DOCSTRING = r"""
+    This model inherits from [`PreTrainedModel`]. Check the superclass documentation for the generic methods the
+    library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads
+    etc.)
+
+    This model is also a PyTorch [torch.nn.Module](https://pytorch.org/docs/stable/nn.html#torch.nn.Module) subclass.
+    Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage
+    and behavior.
+
+    Parameters:
+        config ([`BaichuanM1Config`]):
+            Model configuration class with all the parameters of the model. Initializing with a config file does not
+            load the weights associated with the model, only the configuration. Check out the
+            [`~PreTrainedModel.from_pretrained`] method to load the model weights.
+"""
+
+
+@add_start_docstrings(
+    "The bare Bai chuan Model outputting raw hidden-states without any specific head on top.",
+    Baichuan_START_DOCSTRING,
+)
+class BaichuanPreTrainedModel(PreTrainedModel):
+    config_class = BaichuanM1Config
+    base_model_prefix = "model"
+    supports_gradient_checkpointing = True
+    _no_split_modules = ["BaichuanDecoderLayer"]
+    _skip_keys_device_placement = "past_key_values"
+    _supports_flash_attn_2 = True
+    _supports_sdpa = True
+    _supports_cache_class = True
+    _supports_quantized_cache = True
+    _supports_static_cache = True
+
+    def _init_weights(self, module):
+        std = self.config.initializer_range
+        if isinstance(module, nn.Linear):
+            module.weight.data.normal_(mean=0.0, std=std)
+            if module.bias is not None:
+                module.bias.data.zero_()
+        elif isinstance(module, nn.Embedding):
+            module.weight.data.normal_(mean=0.0, std=std)
+            if module.padding_idx is not None:
+                module.weight.data[module.padding_idx].zero_()
+
+
+Baichuan_INPUTS_DOCSTRING = r"""
+
+"""
+
+
+@add_start_docstrings(
+    "The bare Baichuan Model outputting raw hidden-states without any specific head on top.",
+    Baichuan_START_DOCSTRING,
+)
+class BaichuanModel(BaichuanPreTrainedModel):
+    """
+    Transformer decoder consisting of *config.num_hidden_layers* layers. Each layer is a [`BaichuanDecoderLayer`]
+
+    Args:
+        config: BaichuanM1Config
+    """
+
+    def __init__(self, config: BaichuanM1Config):
+        super().__init__(config)
+        self.padding_idx = config.pad_token_id
+        self.vocab_size = config.vocab_size
+
+        self.embed_tokens = nn.Embedding(config.vocab_size, config.hidden_size, self.padding_idx)
+        self.layers = nn.ModuleList(
+            [BaichuanDecoderLayer(config, layer_idx) for layer_idx in range(config.num_hidden_layers)]
+        )
+        self._attn_implementation = config._attn_implementation
+        self.norm = BaichuanRMSNorm(config.hidden_size, eps=config.rms_norm_eps)
+
+        self.gradient_checkpointing = True
+        # Initialize weights and apply final processing
+        self.post_init()
+
+    def get_input_embeddings(self):
+        return self.embed_tokens
+
+    def set_input_embeddings(self, value):
+        self.embed_tokens = value
+
+    @add_start_docstrings_to_model_forward(Baichuan_INPUTS_DOCSTRING)
+    def forward(
+            self,
+            input_ids: torch.LongTensor = None,
+            attention_mask: Optional[torch.Tensor] = None,
+            position_ids: Optional[torch.LongTensor] = None,
+            seqlens: Optional[torch.LongTensor] = None,
+            past_key_values: Optional[CustomCache] = None,
+            inputs_embeds: Optional[torch.FloatTensor] = None,
+            use_cache: Optional[bool] = None,
+            output_attentions: Optional[bool] = None,
+            output_hidden_states: Optional[bool] = None,
+            return_dict: Optional[bool] = None,
+            cache_position: Optional[torch.LongTensor] = None,
+    ) -> Union[Tuple, BaseModelOutputWithPast]:
+        output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
+        output_hidden_states = (
+            output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
+        )
+        use_cache = use_cache if use_cache is not None else self.config.use_cache
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+        if (input_ids is None) ^ (inputs_embeds is not None):
+            raise ValueError(
+                "You cannot specify both input_ids and inputs_embeds at the same time, and must specify either one"
+            )
+
+        if self.gradient_checkpointing and self.training:
+            if use_cache:
+                logger.warning_once(
+                    "`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`..."
+                )
+                use_cache = False
+
+        if seqlens is not None:
+            assert seqlens.ndim == 2
+                # batch multi-pack 样本拉平
+            cu_seqlens = []
+            offset, seqlen = 0, seqlens.size(1)
+            for lens in seqlens:
+                cu_seqlens.append(offset)
+                cu_seqlens.extend((lens[(lens > 0) & (lens < seqlen)] + offset).tolist())
+                offset += seqlen
+            cu_seqlens.append(offset)
+            seqlens = torch.tensor(cu_seqlens, dtype=torch.int32, device=input_ids.device)
+            # unset attention_mask to save memory
+            attention_mask = None
+        # kept for BC (non `Cache` `past_key_values` inputs)
+        return_legacy_cache = False
+        if use_cache and not isinstance(past_key_values, CustomCache):
+            return_legacy_cache = False
+            if past_key_values is None:
+                past_key_values = CustomCache()
+            else:
+                past_key_values = CustomCache.from_legacy_cache(past_key_values)
+                logger.warning_once(
+                    "We detected that you are passing `past_key_values` as a tuple of tuples. This is deprecated and "
+                    "will be removed in v4.47. Please convert your cache or use an appropriate `Cache` class "
+                    "(https://huggingface.co/docs/transformers/kv_cache#legacy-cache-format)"
+                )
+        if inputs_embeds is None:
+            inputs_embeds = self.embed_tokens(input_ids)
+
+        if cache_position is None:
+            past_seen_tokens = past_key_values.get_seq_length() if past_key_values is not None else 0
+            cache_position = torch.arange(
+                past_seen_tokens, past_seen_tokens + inputs_embeds.shape[1], device=inputs_embeds.device
+            )
+        if position_ids is None:
+            position_ids = cache_position.unsqueeze(0)
+
+        causal_mask = self._update_causal_mask(
+            attention_mask, inputs_embeds, cache_position, past_key_values, output_attentions
+        )
+
+        hidden_states = inputs_embeds
+
+        # create position embeddings to be shared across the decoder layers
+        # position_embeddings = self.rotary_emb(hidden_states, position_ids)
+        position_embeddings = None
+
+        # decoder layers
+        all_hidden_states = () if output_hidden_states else None
+        all_self_attns = () if output_attentions else None
+        next_decoder_cache = None
+
+        for decoder_layer in self.layers:
+            if output_hidden_states:
+                all_hidden_states += (hidden_states,)
+            if self.gradient_checkpointing and self.training:
+                layer_outputs = torch.utils.checkpoint.checkpoint(
+                    decoder_layer,
+                    hidden_states,
+                    causal_mask,
+                    position_ids,
+                    seqlens,
+                    past_key_values,
+                    output_attentions,
+                    use_cache,
+                    cache_position,
+                    position_embeddings,
+                )
+            else:
+                layer_outputs = decoder_layer(
+                    hidden_states,
+                    attention_mask=causal_mask,
+                    position_ids=position_ids,
+                    seqlens=seqlens,
+                    past_key_value=past_key_values,
+                    output_attentions=output_attentions,
+                    use_cache=use_cache,
+                    cache_position=cache_position,
+                    position_embeddings=position_embeddings,
+                )
+
+            hidden_states = layer_outputs[0]
+            if use_cache:
+                next_decoder_cache = layer_outputs[2 if output_attentions else 1]
+
+            if output_attentions:
+                all_self_attns += (layer_outputs[1],)
+
+        hidden_states = self.norm(hidden_states)
+
+        # add hidden states from the last decoder layer
+        if output_hidden_states:
+            all_hidden_states += (hidden_states,)
+
+        next_cache = next_decoder_cache if use_cache else None
+        if return_legacy_cache:
+            next_cache = next_cache.to_legacy_cache()
+        if not return_dict:
+            return tuple(v for v in [hidden_states, next_cache, all_hidden_states, all_self_attns] if v is not None)
+        return BaseModelOutputWithPast(
+            last_hidden_state=hidden_states,
+            past_key_values=next_cache,
+            hidden_states=all_hidden_states,
+            attentions=all_self_attns,
+        )
+
+
+    def _update_causal_mask(
+            self,
+            attention_mask: torch.Tensor,
+            input_tensor: torch.Tensor,
+            cache_position: torch.Tensor,
+            past_key_values: CustomCache,
+            output_attentions: bool,
+    ):
+        if self.config._attn_implementation == "flash_attention_2":
+            if attention_mask is not None and 0.0 in attention_mask:
+                return attention_mask
+            return None
+
+        # For SDPA, when possible, we will rely on its `is_causal` argument instead of its `attn_mask` argument, in
+        # order to dispatch on Flash Attention 2. This feature is not compatible with static cache, as SDPA will fail
+        # to infer the attention mask.
+        past_seen_tokens = past_key_values.get_seq_length() if past_key_values is not None else 0
+        using_static_cache = isinstance(past_key_values, StaticCache)
+
+        # When output attentions is True, sdpa implementation's forward method calls the eager implementation's forward
+        if self.config._attn_implementation == "sdpa" and not using_static_cache and not output_attentions:
+            if AttentionMaskConverter._ignore_causal_mask_sdpa(
+                    attention_mask,
+                    inputs_embeds=input_tensor,
+                    past_key_values_length=past_seen_tokens,
+                    is_training=self.training,
+            ):
+                return None
+
+        dtype, device = input_tensor.dtype, input_tensor.device
+        min_dtype = torch.finfo(dtype).min
+        sequence_length = input_tensor.shape[1]
+        if using_static_cache:
+            target_length = past_key_values.get_max_length()
+        else:
+            target_length = (
+                attention_mask.shape[-1]
+                if isinstance(attention_mask, torch.Tensor)
+                else past_seen_tokens + sequence_length + 1
+            )
+
+        # In case the provided `attention` mask is 2D, we generate a causal mask here (4D).
+        causal_mask = _prepare_4d_causal_attention_mask_with_cache_position(
+            attention_mask,
+            sequence_length=sequence_length,
+            target_length=target_length,
+            dtype=dtype,
+            device=device,
+            min_dtype=min_dtype,
+            cache_position=cache_position,
+            batch_size=input_tensor.shape[0],
+        )
+
+        if (
+                self.config._attn_implementation == "sdpa"
+                and attention_mask is not None
+                and attention_mask.device.type == "cuda"
+                and not output_attentions
+        ):
+            # Attend to all tokens in fully masked rows in the causal_mask, for example the relevant first rows when
+            # using left padding. This is required by F.scaled_dot_product_attention memory-efficient attention path.
+            # Details: https://github.com/pytorch/pytorch/issues/110213
+            causal_mask = AttentionMaskConverter._unmask_unattended(causal_mask, min_dtype)
+
+        return causal_mask
+
+
+class NormHead(nn.Module):
+    def __init__(self, hidden_size, vocab_size, bias=False):
+        super().__init__()
+        self.weight = nn.Parameter(torch.empty((vocab_size, hidden_size)))
+        nn.init.kaiming_uniform_(self.weight, a=math.sqrt(5))
+
+    def forward(self, hidden_states):
+        norm_weight = nn.functional.normalize(self.weight)
+        return nn.functional.linear(hidden_states, norm_weight)
+
+
+class BaichuanM1ForCausalLM(BaichuanPreTrainedModel, GenerationMixin):
+    _tied_weights_keys = ["lm_head.weight"]
+
+    def __init__(self, config):
+        super().__init__(config)
+        self.model = BaichuanModel(config)
+        self.vocab_size = config.vocab_size
+        self.lm_head = NormHead(config.hidden_size, config.vocab_size, bias=False)
+
+        # Initialize weights and apply final processing
+        self.post_init()
+
+    def get_input_embeddings(self):
+        return self.model.embed_tokens
+
+    def set_input_embeddings(self, value):
+        self.model.embed_tokens = value
+
+    def get_output_embeddings(self):
+        return self.lm_head
+
+    def set_output_embeddings(self, new_embeddings):
+        self.lm_head = new_embeddings
+
+    def set_decoder(self, decoder):
+        self.model = decoder
+
+    def get_decoder(self):
+        return self.model
+
+    @add_start_docstrings_to_model_forward(Baichuan_INPUTS_DOCSTRING)
+    def forward(
+            self,
+            input_ids: torch.LongTensor = None,
+            attention_mask: Optional[torch.Tensor] = None,
+            position_ids: Optional[torch.LongTensor] = None,
+            seqlens: Optional[torch.LongTensor] = None,
+            past_key_values: Optional[List[torch.FloatTensor]] = None,
+            inputs_embeds: Optional[torch.FloatTensor] = None,
+            labels: Optional[torch.LongTensor] = None,
+            use_cache: Optional[bool] = None,
+            output_attentions: Optional[bool] = None,
+            output_hidden_states: Optional[bool] = None,
+            return_dict: Optional[bool] = None,
+            cache_position: Optional[torch.LongTensor] = None,
+            num_logits_to_keep: int = 0,
+    ) -> Union[Tuple, CausalLMOutputWithPast]:
+        r"""
+        Args:
+            labels (`torch.LongTensor` of shape `(batch_size, sequence_length)`, *optional*):
+                Labels for computing the masked language modeling loss. Indices should either be in `[0, ...,
+                config.vocab_size]` or -100 (see `input_ids` docstring). Tokens with indices set to `-100` are ignored
+                (masked), the loss is only computed for the tokens with labels in `[0, ..., config.vocab_size]`.
+
+            num_logits_to_keep (`int`, *optional*):
+                Calculate logits for the last `num_logits_to_keep` tokens. If `0`, calculate logits for all
+                `input_ids` (special case). Only last token logits are needed for generation, and calculating them only for that
+                token can save memory, which becomes pretty significant for long sequences or large vocabulary size.
+
+        Returns:
+
+        Example:
+
+        ```python
+        >>> from transformers import AutoTokenizer, BaichuanForCausalLM
+
+        >>> model = BaichuanForCausalLM.from_pretrained(PATH_TO_CONVERTED_WEIGHTS)
+        >>> tokenizer = AutoTokenizer.from_pretrained(PATH_TO_CONVERTED_TOKENIZER)
+
+        >>> prompt = "Hey, are you conscious? Can you talk to me?"
+        >>> inputs = tokenizer(prompt, return_tensors="pt")
+
+        >>> # Generate
+        >>> generate_ids = model.generate(inputs.input_ids, max_length=30)
+        >>> tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
+        "Hey, are you conscious? Can you talk to me?\nI'm not conscious, but I can talk to you."
+        ```"""
+        output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
+        output_hidden_states = (
+            output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
+        )
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+        if input_ids is not None:
+            input_ids[input_ids == self.config.vocab_size] = 0
+        if labels is not None:
+            labels[labels == self.config.vocab_size] = 0
+        # decoder outputs consists of (dec_features, layer_state, dec_hidden, dec_attn)
+        outputs = self.model(
+            input_ids=input_ids,
+            attention_mask=attention_mask,
+            position_ids=position_ids,
+            seqlens=seqlens,
+            past_key_values=past_key_values,
+            inputs_embeds=inputs_embeds,
+            use_cache=use_cache,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+            cache_position=cache_position,
+        )
+
+        hidden_states = outputs[0]
+        if labels is None and not is_torchdynamo_compiling():
+            logger.warning_once(
+                "Starting from v4.46, the `logits` model output will have the same type as the model (except at train time, where it will always be FP32)"
+            )
+        # Only compute necessary logits, and do not upcast them to float if we are not computing the loss
+        # TODO: remove the float() operation in v4.46
+        logits = self.lm_head(hidden_states[:, -num_logits_to_keep:, :])
+
+        loss = None
+        if labels is not None:
+            # Upcast to float if we need to compute the loss to avoid potential precision issues
+            # logits = logits.float()
+            # Shift so that tokens < n predict n
+            shift_logits = logits[..., :-1, :].contiguous()
+            shift_labels = labels[..., 1:].contiguous()
+            #shift_logits = logits
+            #shift_labels = labels
+            # Flatten the tokens
+            loss_fct = CrossEntropyLoss()
+            shift_logits = shift_logits.view(-1, self.config.vocab_size)
+            shift_labels = shift_labels.view(-1)
+            # Enable model parallelism
+            shift_labels = shift_labels.to(shift_logits.device)
+            loss = loss_fct(shift_logits, shift_labels)
+
+        if not return_dict:
+            output = (logits,) + outputs[1:]
+            return (loss,) + output if loss is not None else output
+
+        return CausalLMOutputWithPast(
+            loss=loss,
+            logits=logits,
+            past_key_values=outputs.past_key_values,
+            hidden_states=outputs.hidden_states,
+            attentions=outputs.attentions,
+        )
+
+    def prepare_inputs_for_generation(
+            self,
+            input_ids,
+            past_key_values=None,
+            attention_mask=None,
+            inputs_embeds=None,
+            cache_position=None,
+            position_ids=None,
+            use_cache=True,
+            num_logits_to_keep=None,
+            **kwargs,
+    ):
+        # If we have cache: let's slice `input_ids` through `cache_position`, to keep only the unprocessed tokens
+        # Exception 1: when passing input_embeds, input_ids may be missing entries
+        # Exception 2: some generation methods do special slicing of input_ids, so we don't need to do it here
+        if past_key_values is not None:
+            if inputs_embeds is not None:  # Exception 1
+                input_ids = input_ids[:, -cache_position.shape[0]:]
+            elif input_ids.shape[1] != cache_position.shape[0]:  # Default case (the "else", a no op, is Exception 2)
+                input_ids = input_ids[:, cache_position]
+
+        if attention_mask is not None and position_ids is None:
+            # create position_ids on the fly for batch generation
+            position_ids = attention_mask.long().cumsum(-1) - 1
+            position_ids.masked_fill_(attention_mask == 0, 1)
+            if past_key_values:
+                position_ids = position_ids[:, -input_ids.shape[1]:]
+
+                # This `clone` call is needed to avoid recapturing cuda graphs with `torch.compile`'s  `mode="reduce-overhead`, as otherwise the input `position_ids` would have various stride during the decoding. Here, simply using `.contiguous()` is not sufficient as in the batch size = 1 case, `position_ids` is already contiguous but with varying stride which retriggers a capture.
+                position_ids = position_ids.clone(memory_format=torch.contiguous_format)
+
+        # if `inputs_embeds` are passed, we only want to use them in the 1st generation step
+        if inputs_embeds is not None and cache_position[0] == 0:
+            model_inputs = {"inputs_embeds": inputs_embeds, "input_ids": None}
+        else:
+            # The clone here is for the same reason as for `position_ids`.
+            model_inputs = {"input_ids": input_ids.clone(memory_format=torch.contiguous_format), "inputs_embeds": None}
+
+        if isinstance(past_key_values, StaticCache) and attention_mask.ndim == 2:
+            if model_inputs["inputs_embeds"] is not None:
+                batch_size, sequence_length, _ = model_inputs["inputs_embeds"].shape
+                device = model_inputs["inputs_embeds"].device
+            else:
+                batch_size, sequence_length = model_inputs["input_ids"].shape
+                device = model_inputs["input_ids"].device
+
+            dtype = self.lm_head.weight.dtype
+            min_dtype = torch.finfo(dtype).min
+
+            attention_mask = _prepare_4d_causal_attention_mask_with_cache_position(
+                attention_mask,
+                sequence_length=sequence_length,
+                target_length=past_key_values.get_max_length(),
+                dtype=dtype,
+                device=device,
+                min_dtype=min_dtype,
+                cache_position=cache_position,
+                batch_size=batch_size,
+            )
+
+        if num_logits_to_keep is not None:
+            model_inputs["num_logits_to_keep"] = num_logits_to_keep
+
+        model_inputs.update(
+            {
+                "position_ids": position_ids,
+                "cache_position": cache_position,
+                "past_key_values": past_key_values,
+                "use_cache": use_cache,
+                "attention_mask": attention_mask,
+            }
+        )
+        return model_inputs
+
+    @torch.no_grad()
+    def chat(self, tokenizer, messages: List[dict], stream=False,
+             generation_config: Optional[GenerationConfig] = None):
+        generation_config = generation_config or self.generation_config
+        input_ids = tokenizer.apply_chat_template(messages, tokenize=True, add_generation_prompt=True)
+        input_ids = torch.LongTensor([input_ids]).to(self.device)
+        if stream:
+            streamer = TextIterStreamer(tokenizer, skip_prompt=True, skip_special_tokens=True)
+            Thread(target=self.generate, kwargs=dict(
+                inputs=input_ids, streamer=streamer,
+                generation_config=generation_config,
+            )).start()
+            return streamer
+        else:
+            outputs = self.generate(input_ids, generation_config=generation_config)
+            response = tokenizer.decode(outputs[0][len(input_ids[0]):], skip_special_tokens=True)
+            return response
diff --git a/special_tokens_map.json b/special_tokens_map.json
new file mode 100644
index 0000000..4248372
--- /dev/null
+++ b/special_tokens_map.json
@@ -0,0 +1,46 @@
+{
+  "additional_special_tokens": [
+    "<|im_start|>",
+    "<|im_end|>",
+    "<|object_ref_start|>",
+    "<|object_ref_end|>",
+    "<|box_start|>",
+    "<|box_end|>",
+    "<|quad_start|>",
+    "<|quad_end|>",
+    "<|vision_start|>",
+    "<|vision_end|>",
+    "<|vision_pad|>",
+    "<|image_pad|>",
+    "<|video_pad|>",
+    "<B_SYS>",
+    "<B_USYS>",
+    "<C_Q>",
+    "<C_A>",
+    "<|im_sep|>",
+    "<|tool_call|>",
+    "<|arguments|>"
+  ],
+  "bos_token": {
+    "content": "<s>",
+    "lstrip": false,
+    "normalized": true,
+    "rstrip": false,
+    "single_word": false
+  },
+  "eos_token": {
+    "content": "</s>",
+    "lstrip": false,
+    "normalized": true,
+    "rstrip": false,
+    "single_word": false
+  },
+  "pad_token": "<pad>",
+  "unk_token": {
+    "content": "<unk>",
+    "lstrip": false,
+    "normalized": true,
+    "rstrip": false,
+    "single_word": false
+  }
+}
diff --git a/tokenization_baichuan.py b/tokenization_baichuan.py
new file mode 100644
index 0000000..2a6db7d
--- /dev/null
+++ b/tokenization_baichuan.py
@@ -0,0 +1,231 @@
+import os
+from shutil import copyfile
+from typing import Any, Dict, List, Optional, Tuple
+
+import sentencepiece as spm
+
+from transformers.tokenization_utils import AddedToken, PreTrainedTokenizer
+from transformers.utils import logging
+
+
+logger = logging.get_logger(__name__)
+
+VOCAB_FILES_NAMES = {"vocab_file": "tokenizer.model"}
+
+PRETRAINED_VOCAB_FILES_MAP = {
+    "vocab_file": {},
+    "tokenizer_file": {},
+}
+PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES = {}
+
+
+class BaichuanTokenizer(PreTrainedTokenizer):
+    """
+    Construct a Baichuan tokenizer. Based on byte-level Byte-Pair-Encoding.
+
+    Args:
+        vocab_file (`str`):
+            Path to the vocabulary file.
+    """
+
+    vocab_files_names = VOCAB_FILES_NAMES
+    pretrained_vocab_files_map = PRETRAINED_VOCAB_FILES_MAP
+    max_model_input_sizes = PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES
+    model_input_names = ["input_ids", "attention_mask"]
+
+    def __init__(
+        self,
+        vocab_file,
+        unk_token="<unk>",
+        bos_token="<s>",
+        eos_token="</s>",
+        pad_token=None,
+        sp_model_kwargs: Optional[Dict[str, Any]] = None,
+        add_bos_token=True,
+        add_eos_token=False,
+        clean_up_tokenization_spaces=False,
+        **kwargs,
+    ):
+        self.sp_model_kwargs = {} if sp_model_kwargs is None else sp_model_kwargs
+        bos_token = AddedToken(bos_token, lstrip=False, rstrip=False) if isinstance(bos_token, str) else bos_token
+        eos_token = AddedToken(eos_token, lstrip=False, rstrip=False) if isinstance(eos_token, str) else eos_token
+        unk_token = AddedToken(unk_token, lstrip=False, rstrip=False) if isinstance(unk_token, str) else unk_token
+        pad_token = AddedToken(pad_token, lstrip=False, rstrip=False) if isinstance(pad_token, str) else pad_token
+        self.vocab_file = vocab_file
+        self.add_bos_token = add_bos_token
+        self.add_eos_token = add_eos_token
+        self.sp_model = spm.SentencePieceProcessor(**self.sp_model_kwargs)
+        self.sp_model.Load(vocab_file)
+        super().__init__(
+            bos_token=bos_token,
+            eos_token=eos_token,
+            unk_token=unk_token,
+            pad_token=pad_token,
+            add_bos_token=add_bos_token,
+            add_eos_token=add_eos_token,
+            sp_model_kwargs=self.sp_model_kwargs,
+            clean_up_tokenization_spaces=clean_up_tokenization_spaces,
+            **kwargs,
+        )
+        self.pad_token_id = self._convert_token_to_id(self.pad_token)
+
+    def __getstate__(self):
+        state = self.__dict__.copy()
+        state["sp_model"] = None
+        return state
+
+    def __setstate__(self, d):
+        self.__dict__ = d
+        self.sp_model = spm.SentencePieceProcessor(**self.sp_model_kwargs)
+        self.sp_model.Load(self.vocab_file)
+
+    @property
+    def vocab_size(self):
+        """Returns vocab size"""
+        return self.sp_model.get_piece_size()
+
+    def get_vocab(self):
+        """Returns vocab as a dict"""
+        vocab = {self.convert_ids_to_tokens(i): i for i in range(self.vocab_size)}
+        vocab.update(self.added_tokens_encoder)
+        return vocab
+
+    def _tokenize(self, text):
+        """Returns a tokenized string."""
+        return self.sp_model.encode(text, out_type=str)
+
+    def _convert_token_to_id(self, token):
+        """Converts a token (str) in an id using the vocab."""
+        return self.sp_model.piece_to_id(token)
+
+    def _convert_id_to_token(self, index):
+        """Converts an index (integer) in a token (str) using the vocab."""
+        token = self.sp_model.IdToPiece(index)
+        return token
+
+    def convert_tokens_to_string(self, tokens):
+        """Converts a sequence of tokens (string) in a single string."""
+        current_sub_tokens = []
+        out_string = ""
+        prev_is_special = False
+        for i, token in enumerate(tokens):
+            # make sure that special tokens are not decoded using sentencepiece model
+            if token in self.all_special_tokens:
+                if not prev_is_special and i != 0:
+                    out_string += " "
+                out_string += self.sp_model.decode(current_sub_tokens) + token
+                prev_is_special = True
+                current_sub_tokens = []
+            else:
+                current_sub_tokens.append(token)
+                prev_is_special = False
+        out_string += self.sp_model.decode(current_sub_tokens)
+        return out_string
+
+    def save_vocabulary(self, save_directory, filename_prefix: Optional[str] = None) -> Tuple[str]:
+        """
+        Save the vocabulary and special tokens file to a directory.
+
+        Args:
+            save_directory (`str`):
+                The directory in which to save the vocabulary.
+
+        Returns:
+            `Tuple(str)`: Paths to the files saved.
+        """
+        if not os.path.isdir(save_directory):
+            logger.error(f"Vocabulary path ({save_directory}) should be a directory")
+            return
+        out_vocab_file = os.path.join(
+            save_directory, (filename_prefix + "-" if filename_prefix else "") + VOCAB_FILES_NAMES["vocab_file"]
+        )
+
+        if os.path.abspath(self.vocab_file) != os.path.abspath(out_vocab_file) and os.path.isfile(self.vocab_file):
+            copyfile(self.vocab_file, out_vocab_file)
+        elif not os.path.isfile(self.vocab_file):
+            with open(out_vocab_file, "wb") as fi:
+                content_spiece_model = self.sp_model.serialized_model_proto()
+                fi.write(content_spiece_model)
+
+        return (out_vocab_file,)
+
+    def build_inputs_with_special_tokens(self, token_ids_0, token_ids_1=None):
+        bos_token_id = [self.bos_token_id] if self.add_bos_token else []
+        eos_token_id = [self.eos_token_id] if self.add_eos_token else []
+
+        output = bos_token_id + token_ids_0 + eos_token_id
+
+        if token_ids_1 is not None:
+            output = output + bos_token_id + token_ids_1 + eos_token_id
+
+        return output
+
+    def get_special_tokens_mask(
+        self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None, already_has_special_tokens: bool = False
+    ) -> List[int]:
+        """
+        Retrieve sequence ids from a token list that has no special tokens added. This method is called when adding
+        special tokens using the tokenizer `prepare_for_model` method.
+
+        Args:
+            token_ids_0 (`List[int]`):
+                List of IDs.
+            token_ids_1 (`List[int]`, *optional*):
+                Optional second list of IDs for sequence pairs.
+            already_has_special_tokens (`bool`, *optional*, defaults to `False`):
+                Whether or not the token list is already formatted with special tokens for the model.
+
+        Returns:
+            `List[int]`: A list of integers in the range [0, 1]: 1 for a special token, 0 for a sequence token.
+        """
+        if already_has_special_tokens:
+            return super().get_special_tokens_mask(
+                token_ids_0=token_ids_0, token_ids_1=token_ids_1, already_has_special_tokens=True
+            )
+
+        bos_token_id = [1] if self.add_bos_token else []
+        eos_token_id = [1] if self.add_eos_token else []
+
+        if token_ids_1 is None:
+            return bos_token_id + ([0] * len(token_ids_0)) + eos_token_id
+        return (
+            bos_token_id
+            + ([0] * len(token_ids_0))
+            + eos_token_id
+            + bos_token_id
+            + ([0] * len(token_ids_1))
+            + eos_token_id
+        )
+
+    def create_token_type_ids_from_sequences(
+        self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None
+    ) -> List[int]:
+        """
+        Creates a mask from the two sequences passed to be used in a sequence-pair classification task. An ALBERT
+        sequence pair mask has the following format:
+
+        ```
+        0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1
+        | first sequence    | second sequence |
+        ```
+
+        if token_ids_1 is None, only returns the first portion of the mask (0s).
+
+        Args:
+            token_ids_0 (`List[int]`):
+                List of ids.
+            token_ids_1 (`List[int]`, *optional*):
+                Optional second list of IDs for sequence pairs.
+
+        Returns:
+            `List[int]`: List of [token type IDs](../glossary#token-type-ids) according to the given sequence(s).
+        """
+        bos_token_id = [self.bos_token_id] if self.add_bos_token else []
+        eos_token_id = [self.eos_token_id] if self.add_eos_token else []
+
+        output = [0] * len(bos_token_id + token_ids_0 + eos_token_id)
+
+        if token_ids_1 is not None:
+            output += [1] * len(bos_token_id + token_ids_1 + eos_token_id)
+
+        return output
diff --git a/tokenizer.model b/tokenizer.model
new file mode 100644
index 0000000..6a9e8a3
--- /dev/null
+++ b/tokenizer.model
@@ -0,0 +1,3 @@
+version https://git-lfs.github.com/spec/v1
+oid sha256:6f5af87706706ff930034b468c7f315c7da31de5f35d5b71a6458329ef5d9034
+size 2224601
diff --git a/tokenizer_config.json b/tokenizer_config.json
new file mode 100644
index 0000000..c6cd6aa
--- /dev/null
+++ b/tokenizer_config.json
@@ -0,0 +1,389 @@
+{
+  "add_bos_token": false,
+  "add_eos_token": false,
+  "added_tokens_decoder": {
+    "0": {
+      "content": "<pad>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "1": {
+      "content": "<s>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "2": {
+      "content": "</s>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "3": {
+      "content": "<unk>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "50": {
+      "content": "<|im_start|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "51": {
+      "content": "<|im_end|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "52": {
+      "content": "<|object_ref_start|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "53": {
+      "content": "<|object_ref_end|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "54": {
+      "content": "<|box_start|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "55": {
+      "content": "<|box_end|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "56": {
+      "content": "<|quad_start|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "57": {
+      "content": "<|quad_end|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "58": {
+      "content": "<|vision_start|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "59": {
+      "content": "<|vision_end|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "60": {
+      "content": "<|vision_pad|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "61": {
+      "content": "<|image_pad|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "62": {
+      "content": "<|video_pad|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "63": {
+      "content": "<tool_call>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "64": {
+      "content": "</tool_call>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "65": {
+      "content": "<|fim_prefix|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "66": {
+      "content": "<|fim_middle|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "67": {
+      "content": "<|fim_suffix|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "68": {
+      "content": "<|fim_pad|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "69": {
+      "content": "<|repo_name|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "70": {
+      "content": "<|file_sep|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "71": {
+      "content": "<B_SYS>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "72": {
+      "content": "<B_USYS>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "73": {
+      "content": "<C_Q>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "74": {
+      "content": "<C_A>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "75": {
+      "content": "<B_FUNC>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "76": {
+      "content": "<B_CODE>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "77": {
+      "content": "<B_APE>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "78": {
+      "content": "<function_calling>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "79": {
+      "content": "<calc_start>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "80": {
+      "content": "<calc_end>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "81": {
+      "content": "<inner_think>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "82": {
+      "content": "<|im_sep|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "83": {
+      "content": "<|tool_call|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "84": {
+      "content": "<|arguments|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "85": {
+      "content": "<|o1_step|>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "86": {
+      "content": "<|o1_answer|>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "87": {
+      "content": "<tree_node>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "88": {
+      "content": "</tree_node>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    }
+  },
+  "additional_special_tokens": [
+    "<|im_start|>",
+    "<|im_end|>",
+    "<|object_ref_start|>",
+    "<|object_ref_end|>",
+    "<|box_start|>",
+    "<|box_end|>",
+    "<|quad_start|>",
+    "<|quad_end|>",
+    "<|vision_start|>",
+    "<|vision_end|>",
+    "<|vision_pad|>",
+    "<|image_pad|>",
+    "<|video_pad|>",
+    "<B_SYS>",
+    "<B_USYS>",
+    "<C_Q>",
+    "<C_A>",
+    "<|im_sep|>",
+    "<|tool_call|>",
+    "<|arguments|>"
+  ],
+  "auto_map": {
+    "AutoTokenizer": [
+      "tokenization_baichuan.BaichuanTokenizer",
+      null
+    ]
+  },
+  "bos_token": "<s>",
+  "chat_template": "{% for message in messages %}{% if message['role'] == 'system' %}{{'<B_SYS>' + message['content']}}{% elif message['role'] == 'user_system' %}{{'<B_USYS>' + message['content']}}{% elif message['role'] == 'user' %}{{'<C_Q>' + message['content']}}{% elif message['role'] == 'assistant' %}{{'<C_A>' + message['content']}}{% elif message['role'] == 'function' %}{{'<B_FUNC>' + message['content']}}{% elif message['role'] == 'code' %}{{'<B_CODE>' + message['content']}}{% else %}{{ raise_exception('Invalid message role: ' + message['role']) }}{% endif %}{% endfor %}{% if add_generation_prompt %}{{'<C_A>'}}{% endif %}",
+  "clean_up_tokenization_spaces": false,
+  "eos_token": "</s>",
+  "extra_special_tokens": {},
+  "model_max_length": 32768,
+  "pad_token": "<pad>",
+  "sp_model_kwargs": {},
+  "tokenizer_class": "BaichuanTokenizer",
+  "unk_token": "<unk>",
+  "use_fast": false
+}

Category	Benchmark	Baichuan-M1-14B-Instruct	Qwen2.5-14B-Instruct	Qwen2.5-72B-Instruct	claude-3.5-sonnet-20241022	gpt-4o
Average Score		72.23	65.39	70.51	74.85	75.00
Clinical Practice	cmbclin	77.40	71.51	75.36	78.37	75.36
	clinicalbench_diag	70.90	68.85	72.23	75.00	73.05
	clinicalbench_hos	70.05	68.83	70.53	65.58	69.38
	clinicalbench_treat	56.38	55.03	57.30	64.03	59.35
	rarearena_rdc	81.80	66.40	76.20	89.60	88.40
	rarearena_rds	54.00	42.60	49.80	59.80	57.20
	rarebench	59.60	52.80	60.60	65.30	62.80
Exams	cmexam	80.10	77.70	82.70	77.50	78.00
	Pediatric Qualification Exam	78.48	74.68	84.81	76.58	78.48
	Internal Medicine Qualification Exam	83.42	86.10	87.17	87.70	83.42
	General Practice Qualification Exam	87.07	88.44	88.44	81.63	84.35
	USMLE	78.00	67.20	76.70	85.90	87.10
	medbullets	66.88	54.22	64.29	72.40	75.97
	mediq	83.40	66.80	79.90	88.80	90.20
	nejmqa	49.75	45.69	50.76	69.54	54.31
	pubmedqa	75.20	76.40	75.60	77.00	77.60
	redisqa	74.50	69.70	75.00	83.20	82.80
Basic Capabilities	mednli_dis	80.40	68.90	74.90	58.30	79.80
	medcalc	56.00	31.40	37.90	52.60	49.00
	MMLU-anatomy	80.00	67.41	71.11	86.67	91.11
	MMLU-virology	54.82	56.02	53.01	54.22	57.23
	MMLU-genetics	91.00	82.00	87.00	97.00	95.00